Kernel not able to connect randomly

For months now we’re seeing that sometimes the kernel can’t connect and our students are dead in the water. Sometimes a page reload solves the issue, sometimes it doesn’t. We’ve turned our system inside out but can’t figure out what might be causing this. It is true that we have a somewhat unusual setup but we can’t figure out how that might influence things.

What we have:

We use a custom system that creates new hubs automatically (it’s a long story as to why we do this). We have a number of machines with dedicated ‘agents’ that create and start those hubs. Those machines each have haproxy installed with 20 pre-configured routes. Each route binds to a public SSL enabled port and then has a static forward to an internal port, which is where we run our hubs.

We run all the latest versions of the hub and labs. We did create custom images of labs since our faculty wanted to have custom libraries installed and it was easier to pre-factor that for students instead of having them do this for every Notebook.

The configuration for one of our haproxy backend entry is something like this:

backend jp00 from unnamed_defaults_1
  mode http
  option http-server-close
  option forwardfor
  option redispatch
  http-request set-header X-Client-IP %[src]
  server hub00 127.0.0.1:10000
  retries 3
  http-response set-header Content-Security-Policy "frame-ancestors *"

We use Canvas as our LMS and our system is designed to embed Jupyter Notebooks directly in a Canvas page. Our main service creates a page that is added to the page through LTI. This page then asks the main service to create a full url to the hub, which is then loaded in a sub frame. All servers involved should have the proper frame-ancestors setting.

What we’re seeing. We’re seeing two things.

  1. Random kernel not connecting

Every now and then a student will start their notebook and the kernel won’t connect. The only evidence we have is that we see the websocket retrying and giving up.

In the docker log of the lab belonging to the student we see that the token authentication failed, the token is said to be invalid, which we can demonstrate can’t be valid since we can make API calls using the created token. Also, the problem resolves itself sometimes, so the token must be present and valid.

In the docker lab log this appears as:

[W 2025-03-20 18:19:58.224 ServerApp] Token stored in cookie may have expired
[W 2025-03-20 18:19:58.224 ServerApp] Couldn't authenticate WebSocket connection

We also see strange things related to cookies. In the network panel we see cookies being created that are already invalid upon creation.

Furthermore there is some strangeness with the endpoint the hub uses for session cookies. We see two endpoints being used:

For some reason it uses and not uses the port and we see two complete duplicates of the session cookies. We suspect this might be the culprit but we have no clue what’s causing this or how to change it (and if it should)

  1. (secondary issue) On Chrome (on some machines) the hub goes into an infinite redirect. Our embedded Canvas iframe page loads fine. The hub url seems to load, but when it goes through its own redirects (from hub to login to lab) it instead gets stuck and redirects from hub to login, back to hub and so on.

If we take the hub url we created and assigned to the iframe and load that in a separate tab it works fine. On Firefox we’ve never seen this issue.

Some of the things we’ve tried (that didn’t work):

  1. Remove any addons from the browsers we’re testing with
  2. Tried different operating systems
  3. Stripped the haproxy config to not have any extraneous settings. For example, we had a cookie set per backend. Our main hypothesis is that all of this is cookie related
  4. Remove any code that might interfere with the websocket or operation of the hub and lab. For example, the frame that holds the hub sends a keepalive. Removing any interaction with the server after the hub url was assigned the the frame was removed

How many front-end Haproxies do you have? Are they load-balanced and users connect to them randomly, or independent and a user only ever connects to a single Haproxy instance?

A user only ever connects to a dedicated url. They will hit haproxy on that server but there is no load balancing going on at the backend. In fact this is what one of my entries looks like:

backend jp00
    option forwardfor
    http-request set-header X-Client-IP %[src]
    http-response set-header Content-Security-Policy "frame-ancestors *"
    option http-server-close
    option redispatch
    retries 3
    server hub00 127.0.0.1:10000

The JupyterHub logs are the most likely to be helpful here.

Do you have any cookie or token-related configuration in your jupyterhub config? Especially related to expiration (max_age_days, etc.)?

Can you share JupyterHub logs surrounding the

Token stored in cookie may have expired

message? In ServerApp logs leading up to this, do you also see

No Hub user identified for request

prior to the “Token stored…” message in the ServerApp logs?

This sounds like an oauth cookie contains an expired token, in which case refreshing the page would go though oauth again, set a new cookie, and be valid for another period of time. You can trigger this manually by visiting the tokens page and revoking oauth tokens while you have a JupyterLab session open (it may take up to 5 minutes for the auth cache to expire before it errors).

If there is any other system or service trying to cleanup tokens?

If a token is immediately invalid upon login, checking with the JupyterHub logs from oauth up to and including the 403 GET /hub/api/user would help. If refreshing the page doesn’t trigger a new oauth, that would be very weird and good to know.

Cookies are per host, not per port, so all ports on the same host share the same cookies. The browser is doing this, not anything on the server side. It is odd that your server appears to be being accessed on multiple ports directly from the browser.

For the redirect, the Hub logs will be the most useful to figure out what’s going on. What Authenticator are you using? Seeing a URL with /hub/?token=... doesn’t make a lot of sense, since JupyterHub doesn’t allow logging in with a token in the URL by default. But it would seem that however the Authenticator is getting a user from URL parameters is not quite behaving as it should. I suspect this is in the Authenticator itself.