After running JupyterHub for several months now, we just noticed the following warnings with high (>5s) durations:
[W 2025-10-14 11:57:17.713 JupyterHub metrics:404] Event loop was unresponsive for at least 1.33s!
[W 2025-10-14 11:59:09.961 JupyterHub metrics:404] Event loop was unresponsive for at least 7.73s!
[W 2025-10-14 11:59:13.949 JupyterHub metrics:404] Event loop was unresponsive for at least 3.94s!
[W 2025-10-14 12:01:43.139 JupyterHub metrics:404] Event loop was unresponsive for at least 1.32s!
[W 2025-10-14 12:17:24.924 JupyterHub metrics:404] Event loop was unresponsive for at least 5.59s!
[W 2025-10-14 12:25:53.979 JupyterHub metrics:404] Event loop was unresponsive for at least 8.69s!
[W 2025-10-14 12:25:57.142 JupyterHub metrics:404] Event loop was unresponsive for at least 3.11s!
[W 2025-10-14 12:26:17.224 JupyterHub metrics:404] Event loop was unresponsive for at least 1.01s!
[W 2025-10-14 12:26:19.054 JupyterHub metrics:404] Event loop was unresponsive for at least 1.78s!
[W 2025-10-14 12:42:19.750 JupyterHub metrics:404] Event loop was unresponsive for at least 6.51s!
[W 2025-10-14 12:42:25.384 JupyterHub metrics:404] Event loop was unresponsive for at least 5.58s!
[W 2025-10-14 12:42:32.950 JupyterHub metrics:404] Event loop was unresponsive for at least 7.52s!
[W 2025-10-14 13:06:49.393 JupyterHub metrics:404] Event loop was unresponsive for at least 18.45s!
[W 2025-10-14 13:06:57.693 JupyterHub metrics:404] Event loop was unresponsive for at least 4.69s!
[W 2025-10-14 13:07:00.573 JupyterHub metrics:404] Event loop was unresponsive for at least 2.83s!
Due to durations ranging from 40 to 120 seconds, the hub crashed twice (reason: Error, exit code: 137). We already tried to investigate the logs. However, since multiple errors occurred (e.g., [W 2025-10-14 11:34:21.951 JupyterHub proxy:944] api_request to the proxy failed with status code 599, retrying..., API requests from the culler timed out, hub-managed services take several seconds to respond, …), most likely because the event loop was unresponsive, it’s hard to find the cause.
So what does it mean that the event loop is unresponsive? And which factors influence the responsiveness?
We also see increased hub response latency during the warnings, but don’t know if that’s the cause or impact.
I remember these sort of messages on our JupyterHub deployment at my previous job. Which variant of proxy are you using: CHP or Traefik? In our case, I remember this behaviour after few months of uninterrupted running of JupyterHub and JupyterHub proxy services. CHP has a known memory leak issue and I also noticed couple of times that CHP has too many open file descriptors due to unclosed sockets.
Event loop unresponsive means it is “blocked” by some function call. In a regular scenario it can be a blocking function that does some heavy computation work. However, that wont be the case for JupyterHub as there no “compute intensive” tasks here. I assume the blocking is caused by some socket stuff here.
I recommend you to look into memory usage and open file descriptors of JupyterHub and CHP when this problem occurs next time!!
Unfortunately, I’m back with the same issue. Sometimes, the event loop becomes unresponsive, which leads to crashes. JupyterHub is no longer reachable, and the logs are flooded with timeout errors.
Since the logs won’t help me find the cause, I guess I’ll have to profile our prod deployment. Does someone already have experience in profiling JupyterHub (on Kubernetes), e.g., using cProfile or pyinstrument? Also, I’m glad about any further hints and possible causes.
Edit: I just restarted the pod and experienced a four minute lag during startup…
...
[D 2026-02-04 15:13:58.503 JupyterHub app:1998] Connecting to db: postgresql+psycopg2://
[D 2026-02-04 15:13:58.603 JupyterHub orm:1509] database schema version found: 4621fec11365
[D 2026-02-04 15:13:58.647 JupyterHub app:2338] Loading roles into database
[D 2026-02-04 15:13:58.647 JupyterHub app:2347] Loading role jupyterhub-idle-culler
[D 2026-02-04 15:13:58.650 JupyterHub app:2347] Loading role admin
[D 2026-02-04 15:13:58.650 JupyterHub app:2349] Overriding default role admin
[D 2026-02-04 15:13:58.650 JupyterHub app:2347] Loading role user
[D 2026-02-04 15:13:58.650 JupyterHub app:2349] Overriding default role user
[D 2026-02-04 15:13:58.651 JupyterHub app:2347] Loading role prometheus-client
[D 2026-02-04 15:13:58.651 JupyterHub app:2347] Loading role admin-reader
[D 2026-02-04 15:13:58.652 JupyterHub app:2347] Loading role announcement
[I 2026-02-04 15:17:56.993 JupyterHub app:2919] Creating service jupyterhub-idle-culler without oauth.
[I 2026-02-04 15:17:56.997 JupyterHub app:2881] Creating service profilemanagement with oauth_client_id=service-profilemanagement
[I 2026-02-04 15:17:57.000 JupyterHub provider:663] Updating oauth client service-profilemanagement
[I 2026-02-04 15:17:57.200 JupyterHub app:2919] Creating service prometheus-client without oauth.
[I 2026-02-04 15:17:57.202 JupyterHub app:2919] Creating service admin-reader without oauth.
...
It looks like the startup delay occurs in this area:
How many users or groups do you have, and what authenticator are you using? Does the authenticator make any external calls? Can you also check there’s no evidence of your database temporarily running out of resources?
We have about 8000 users, two groups, and are using the generic OAuthenticator. The database had consistent CPU and memory utilization.
Out of curiosity, I just restarted the hub, and the startup delay decreased from 4 minutes to 14 seconds. So it’s definitely some sort of database issue (Postgres btw). However, since it’s not due to missing resources, I guess I have to dig deeper into the database and somehow monitor how long each query runs. I’ll report back once the monitoring is up and the issue re-occurs…