For months now we’re seeing that sometimes the kernel can’t connect and our students are dead in the water. Sometimes a page reload solves the issue, sometimes it doesn’t. We’ve turned our system inside out but can’t figure out what might be causing this. It is true that we have a somewhat unusual setup but we can’t figure out how that might influence things.
What we have:
We use a custom system that creates new hubs automatically (it’s a long story as to why we do this). We have a number of machines with dedicated ‘agents’ that create and start those hubs. Those machines each have haproxy installed with 20 pre-configured routes. Each route binds to a public SSL enabled port and then has a static forward to an internal port, which is where we run our hubs.
We run all the latest versions of the hub and labs. We did create custom images of labs since our faculty wanted to have custom libraries installed and it was easier to pre-factor that for students instead of having them do this for every Notebook.
The configuration for one of our haproxy backend entry is something like this:
backend jp00 from unnamed_defaults_1
mode http
option http-server-close
option forwardfor
option redispatch
http-request set-header X-Client-IP %[src]
server hub00 127.0.0.1:10000
retries 3
http-response set-header Content-Security-Policy "frame-ancestors *"
We use Canvas as our LMS and our system is designed to embed Jupyter Notebooks directly in a Canvas page. Our main service creates a page that is added to the page through LTI. This page then asks the main service to create a full url to the hub, which is then loaded in a sub frame. All servers involved should have the proper frame-ancestors setting.
What we’re seeing. We’re seeing two things.
- Random kernel not connecting
Every now and then a student will start their notebook and the kernel won’t connect. The only evidence we have is that we see the websocket retrying and giving up.
In the docker log of the lab belonging to the student we see that the token authentication failed, the token is said to be invalid, which we can demonstrate can’t be valid since we can make API calls using the created token. Also, the problem resolves itself sometimes, so the token must be present and valid.
In the docker lab log this appears as:
[W 2025-03-20 18:19:58.224 ServerApp] Token stored in cookie may have expired
[W 2025-03-20 18:19:58.224 ServerApp] Couldn't authenticate WebSocket connection
We also see strange things related to cookies. In the network panel we see cookies being created that are already invalid upon creation.
Furthermore there is some strangeness with the endpoint the hub uses for session cookies. We see two endpoints being used:
For some reason it uses and not uses the port and we see two complete duplicates of the session cookies. We suspect this might be the culprit but we have no clue what’s causing this or how to change it (and if it should)
- (secondary issue) On Chrome (on some machines) the hub goes into an infinite redirect. Our embedded Canvas iframe page loads fine. The hub url seems to load, but when it goes through its own redirects (from hub to login to lab) it instead gets stuck and redirects from hub to login, back to hub and so on.
If we take the hub url we created and assigned to the iframe and load that in a separate tab it works fine. On Firefox we’ve never seen this issue.
Some of the things we’ve tried (that didn’t work):
- Remove any addons from the browsers we’re testing with
- Tried different operating systems
- Stripped the haproxy config to not have any extraneous settings. For example, we had a cookie set per backend. Our main hypothesis is that all of this is cookie related
- Remove any code that might interfere with the websocket or operation of the hub and lab. For example, the frame that holds the hub sends a keepalive. Removing any interaction with the server after the hub url was assigned the the frame was removed