Occasional 503: Service Unavailable errors on Z2JH for Users

Dylan_Mendonca · February 10, 2025, 8:06pm

It’s a bit of a difficult one to describe but I’m running JupyterHub on EKS using Karpenter for autoscaling.

Occasionally, a user will get kicked out of the hub. This typically happens in the following order:

Proxy pod has an ECONNRESET or ECONNREFUSED error (example included below)

16:09:25.536 [ConfigProxy] [31merror[39m: 503 GET /user/{username}/api/kernels/b2630080-e9eb-4d61-858e-0759be01d816/channels read ECONNRESET

503 errors reported on hub and accordingly on frontend.

I’ve now enabled intense debugging by setting

debug: 
  enabled: true

on the Helm chart. Waiting for the error to pop up again so that I can report this in greater depth.

Off the bat, I’ve noticed the following logs on my Hub:

[D 2025-02-10 20:05:30.428 JupyterHub log:191] 200 GET /hub/health (@10.0.128.38) 0.65ms
[D 2025-02-10 20:05:32.429 JupyterHub log:191] 200 GET /hub/health (@10.0.128.38) 0.86ms
[D 2025-02-10 20:05:34.429 JupyterHub log:191] 200 GET /hub/health (@10.0.128.38) 1.48ms
[D 2025-02-10 20:05:34.430 JupyterHub log:191] 200 GET /hub/health (@10.0.128.38) 1.60ms
[D 2025-02-10 20:05:36.427 JupyterHub log:191] 200 GET /hub/health (@10.0.128.38) 0.65ms
[D 2025-02-10 20:05:38.106 JupyterHub reflector:374] pods watcher timeout
[D 2025-02-10 20:05:38.106 JupyterHub reflector:289] Connecting pods watcher

What’s the pods watcher and why is it timing out? Is this normal?

Happy to share any additional information to help figure this issue out. If anyone has experienced this, what are other steps that I can take?

My current helm configuration:

debug:
  enabled: true
cull:
  enabled: false
proxy:
  chp:
    nodeSelector:
      node.kubernetes.io/instance-type: t3.large
  https:
    enabled: true
    hosts:
      - ...
    letsencrypt:
      contactEmail: ...
debug:
  enabled: true
scheduling:
  userScheduler:
    enabled: false
hub:
  nodeSelector:
    node.kubernetes.io/instance-type: t3.large
  existingSecret: jupyter-oauth-config
  db:
    pvc:
      storageClassName: gp2
  extraConfig:
    auth: |
      <CUSTOM AUTH SCRIPT>

prePuller:
  hook:
    enabled: false
singleuser:
  storage:
    type: dynamic
    capacity: "30Gi"
    dynamic:
      volumeNameTemplate: "jhub-{username}"
      storageClass: jupyter-sc
      pvcNameTemplate: "jhub-{username}"
    extraVolumeMounts:
      - mountPath: /home/jovyan/shared
        name: shared-data
        subPath: shared
    extraVolumes:
      - name: shared-data
        persistentVolumeClaim:
          claimName: jupyter-efs-shared
  profileList:
      ...

mahendrapaipuri · February 13, 2025, 6:12pm

Could you share more logs when that event of users getting kicked out happens? Ideally logs that precede this event and logs after this event in DEBUG mode.

Dylan_Mendonca · February 13, 2025, 6:40pm

Hey @mahendrapaipuri , thanks for responding. I’ve attached some logs below.

The hub:

2025-02-13 13:22:46.370	[D 2025-02-13 18:22:46.369 JupyterHub log:191] 200 GET /hub/health (@10.0.128.210) 0.64ms
2025-02-13 13:22:46.120	[D 2025-02-13 18:22:46.120 JupyterHub reflector:289] Connecting events watcher
2025-02-13 13:22:46.120	[D 2025-02-13 18:22:46.120 JupyterHub reflector:374] events watcher timeout
2025-02-13 13:22:45.093	[D 2025-02-13 18:22:45.093 JupyterHub user:1005] Finished stopping sofia
2025-02-13 13:22:45.084	[D 2025-02-13 18:22:45.084 JupyterHub user:1002] Deleting oauth client jupyterhub-user-sofia
2025-02-13 13:22:45.079	[D 2025-02-13 18:22:45.078 JupyterHub user:982] Stopping sofia
2025-02-13 13:22:45.075	[D 2025-02-13 18:22:45.075 JupyterHub proxy:880] Proxy: Fetching DELETE http://proxy-api:8001/api/routes/user/sofia
2025-02-13 13:22:45.075	[I 2025-02-13 18:22:45.075 JupyterHub proxy:356] Removing user sofia from proxy (/user/sofia/)
2025-02-13 13:22:45.075	[W 2025-02-13 18:22:45.075 JupyterHub base:1154] User sofia server stopped, with exit code: 1
2025-02-13 13:22:44.369	[D 2025-02-13 18:22:44.369 JupyterHub log:191] 200 GET /hub/health (@10.0.128.210) 0.55ms
2025-02-13 13:22:42.588	[I 2025-02-13 18:22:42.588 JupyterHub log:191] 200 GET /hub/error/503?url=%2Fuser%2Fsofia%2Fapi%2Fkernels%3F1739470944550 (@10.0.0.114) 0.97ms
2025-02-13 13:22:42.588	[D 2025-02-13 18:22:42.588 JupyterHub pages:584] No template for 503
2025-02-13 13:22:42.586	[I 2025-02-13 18:22:42.586 JupyterHub log:191] 200 GET /hub/error/503?url=%2Fuser%2Fsofia%2Fapi%2Fterminals%3F1739470943736 (@10.0.0.114) 1.02ms
2025-02-13 13:22:42.585	[D 2025-02-13 18:22:42.585 JupyterHub pages:584] No template for 503
2025-02-13 13:22:42.584	[I 2025-02-13 18:22:42.584 JupyterHub log:191] 200 GET /hub/error/503?url=%2Fuser%2Fsofia%2Flab%2Fapi%2Fworkspaces%3F1739470943751 (@10.0.0.114) 1.47ms
2025-02-13 13:22:42.583	[D 2025-02-13 18:22:42.583 JupyterHub pages:584] No template for 503
2025-02-13 13:22:42.370	[D 2025-02-13 18:22:42.370 JupyterHub log:191] 200 GET /hub/health (@10.0.128.210) 0.97ms
2025-02-13 13:22:42.369	[D 2025-02-13 18:22:42.369 JupyterHub log:191] 200 GET /hub/health (@10.0.128.210) 0.83ms
2025-02-13 13:22:40.369	[D 2025-02-13 18:22:40.369 JupyterHub log:191] 200 GET /hub/health (@10.0.128.210) 0.59ms
2025-02-13 13:22:39.239	[D 2025-02-13 18:22:39.239 JupyterHub proxy:392] Checking routes
2025-02-13 13:22:39.226	[D 2025-02-13 18:22:39.226 JupyterHub proxy:880] Proxy: Fetching GET http://proxy-api:8001/api/routes
2025-02-13 13:22:38.744	[I 2025-02-13 18:22:38.743 JupyterHub log:191] 200 GET /hub/error/503?url=%2Fuser%2Fsofia%2Fapi%2Fmetrics%2Fv1%3F1739470939774 (@10.0.0.114) 1.23ms
2025-02-13 13:22:38.743	[D 2025-02-13 18:22:38.743 JupyterHub pages:584] No template for 503
2025-02-13 13:22:38.742	[I 2025-02-13 18:22:38.742 JupyterHub log:191] 200 GET /hub/error/503?url=%2Fuser%2Fsofia%2Fapi%2Fmetrics%2Fv1%3F1739470939876 (@10.0.0.114) 1.55ms
2025-02-13 13:22:38.741	[D 2025-02-13 18:22:38.741 JupyterHub pages:584] No template for 503
2025-02-13 13:22:38.369	[D 2025-02-13 18:22:38.369 JupyterHub log:191] 200 GET /hub/health (@10.0.128.210) 0.60ms
2025-02-13 13:22:36.480	[D 2025-02-13 18:22:36.480 JupyterHub reflector:289] Connecting pods watcher
2025-02-13 13:22:36.480	[D 2025-02-13 18:22:36.480 JupyterHub reflector:374] pods watcher timeout
2025-02-13 13:22:36.369	[D 2025-02-13 18:22:36.369 JupyterHub log:191] 200 GET /hub/health (@10.0.128.210) 0.57ms
2025-02-13 13:22:36.103	[D 2025-02-13 18:22:36.103 JupyterHub reflector:289] Connecting events watcher
2025-02-13 13:22:36.103	[D 2025-02-13 18:22:36.103 JupyterHub reflector:374] events watcher timeout
2025-02-13 13:22:34.966	[I 2025-02-13 18:22:34.966 JupyterHub log:191] 200 GET /hub/error/503?url=%2Fuser%2Fsofia%2Fapi%2Fsessions%3F1739470944550 (@10.0.0.114) 1.45ms
2025-02-13 13:22:34.965	[D 2025-02-13 18:22:34.965 JupyterHub pages:584] No template for 503
2025-02-13 13:22:34.369	[D 2025-02-13 18:22:34.369 JupyterHub log:191] 200 GET /hub/health (@10.0.128.210) 0.54ms
2025-02-13 13:22:32.370	[D 2025-02-13 18:22:32.370 JupyterHub log:191] 200 GET /hub/health (@10.0.128.210) 0.96ms
2025-02-13 13:22:32.370	[D 2025-02-13 18:22:32.369 JupyterHub log:191] 200 GET /hub/health (@10.0.128.210) 0.79ms
2025-02-13 13:22:30.370	[D 2025-02-13 18:22:30.370 JupyterHub log:191] 200 GET /hub/health (@10.0.128.210) 0.62ms
2025-02-13 13:22:28.369	[D 2025-02-13 18:22:28.369 JupyterHub log:191] 200 GET /hub/health (@10.0.128.210) 0.64ms
2025-02-13 13:22:26.470	[D 2025-02-13 18:22:26.470 JupyterHub reflector:289] Connecting pods watcher
2025-02-13 13:22:26.470	[D 2025-02-13 18:22:26.469 JupyterHub reflector:374] pods watcher timeout
2025-02-13 13:22:26.370	[D 2025-02-13 18:22:26.370 JupyterHub log:191] 200 GET /hub/health (@10.0.128.210) 0.73ms
2025-02-13 13:22:26.088	[D 2025-02-13 18:22:26.088 JupyterHub reflector:289] Connecting events watcher
2025-02-13 13:22:26.088	[D 2025-02-13 18:22:26.087 JupyterHub reflector:374] events watcher timeout
2025-02-13 13:22:24.982	[I 2025-02-13 18:22:24.982 JupyterHub log:191] 200 GET /hub/error/503?url=%2Fuser%2Fsofia%2Fapi%2Fkernels%2F431d4c9f-446e-462e-b892-b5cb10a3a236%2Fchannels%3Fsession_id%3D743ae84d-8c0a-4a9d-9d7b-501b23435ae1 (@10.0.0.114) 1.52ms
2025-02-13 13:22:24.981	[D 2025-02-13 18:22:24.981 JupyterHub pages:584] No template for 503
2025-02-13 13:22:24.369	[D 2025-02-13 18:22:24.368 JupyterHub log:191] 200 GET /hub/health (@10.0.128.210) 0.61ms
2025-02-13 13:22:22.370	[D 2025-02-13 18:22:22.370 JupyterHub log:191] 200 GET /hub/health (@10.0.128.210) 0.93ms
2025-02-13 13:22:22.370	[D 2025-02-13 18:22:22.369 JupyterHub log:191] 200 GET /hub/health (@10.0.128.210) 0.88ms
2025-02-13 13:22:20.368	[D 2025-02-13 18:22:20.368 JupyterHub log:191] 200 GET /hub/health (@10.0.128.210) 0.52ms
2025-02-13 13:22:18.369	[D 2025-02-13 18:22:18.369 JupyterHub log:191] 200 GET /hub/health (@10.0.128.210) 0.64ms
2025-02-13 13:22:16.459	[D 2025-02-13 18:22:16.459 JupyterHub reflector:289] Connecting pods watcher
2025-02-13 13:22:16.459	[D 2025-02-13 18:22:16.459 JupyterHub reflector:374] pods watcher timeout

proxy

2025-02-13 13:22:54.480	18:22:54.480 [ConfigProxy] debug: PROXY WEB /user/sofia/lab/api/workspaces to http://hub:8081
2025-02-13 13:22:45.078	18:22:45.078 [ConfigProxy] info: 204 DELETE /api/routes/user/sofia 
2025-02-13 13:22:45.078	18:22:45.078 [ConfigProxy] info: Removing route /user/sofia
2025-02-13 13:22:43.927	18:22:43.927 [ConfigProxy] debug: PROXY WEB /user/sofia/api/metrics/v1 to http://10.0.128.108:8888
2025-02-13 13:22:43.926	18:22:43.926 [ConfigProxy] debug: PROXY WEB /user/sofia/api/metrics/v1 to http://10.0.128.108:8888
2025-02-13 13:22:42.589	18:22:42.589 [ConfigProxy] debug: Not recording activity for status 503 on /user/sofia
2025-02-13 13:22:42.587	18:22:42.587 [ConfigProxy] debug: Not recording activity for status 503 on /user/sofia
2025-02-13 13:22:42.585	18:22:42.585 [ConfigProxy] debug: Not recording activity for status 503 on /user/sofia
2025-02-13 13:22:42.583	18:22:42.583 [ConfigProxy] debug: Requesting custom error page: http://hub:8081/hub/error/503?url=%2Fuser%2Fsofia%2Fapi%2Fkernels%3F1739470944550
2025-02-13 13:22:42.583	18:22:42.583 [ConfigProxy] error: 503 GET /user/sofia/api/kernels connect EHOSTUNREACH 10.0.128.108:8888
2025-02-13 13:22:42.582	18:22:42.582 [ConfigProxy] debug: Requesting custom error page: http://hub:8081/hub/error/503?url=%2Fuser%2Fsofia%2Fapi%2Fterminals%3F1739470943736
2025-02-13 13:22:42.582	18:22:42.582 [ConfigProxy] error: 503 GET /user/sofia/api/terminals connect EHOSTUNREACH 10.0.128.108:8888
2025-02-13 13:22:42.581	18:22:42.580 [ConfigProxy] debug: Requesting custom error page: http://hub:8081/hub/error/503?url=%2Fuser%2Fsofia%2Flab%2Fapi%2Fworkspaces%3F1739470943751
2025-02-13 13:22:42.579	18:22:42.579 [ConfigProxy] error: 503 GET /user/sofia/lab/api/workspaces connect EHOSTUNREACH 10.0.128.108:8888
2025-02-13 13:22:39.230	18:22:39.229 [ConfigProxy] info: 200 GET /api/routes 
2025-02-13 13:22:38.744	18:22:38.744 [ConfigProxy] debug: Not recording activity for status 503 on /user/sofia
2025-02-13 13:22:38.743	18:22:38.743 [ConfigProxy] debug: Not recording activity for status 503 on /user/sofia
2025-02-13 13:22:38.740	18:22:38.740 [ConfigProxy] debug: Requesting custom error page: http://hub:8081/hub/error/503?url=%2Fuser%2Fsofia%2Fapi%2Fmetrics%2Fv1%3F1739470939774
2025-02-13 13:22:38.740	18:22:38.740 [ConfigProxy] error: 503 GET /user/sofia/api/metrics/v1 connect EHOSTUNREACH 10.0.128.108:8888
2025-02-13 13:22:38.739	18:22:38.739 [ConfigProxy] debug: Requesting custom error page: http://hub:8081/hub/error/503?url=%2Fuser%2Fsofia%2Fapi%2Fmetrics%2Fv1%3F1739470939876
2025-02-13 13:22:38.739	18:22:38.739 [ConfigProxy] error: 503 GET /user/sofia/api/metrics/v1 connect EHOSTUNREACH 10.0.128.108:8888
2025-02-13 13:22:34.968	18:22:34.968 [ConfigProxy] debug: Not recording activity for status 503 on /user/sofia
2025-02-13 13:22:34.963	18:22:34.963 [ConfigProxy] debug: Requesting custom error page: http://hub:8081/hub/error/503?url=%2Fuser%2Fsofia%2Fapi%2Fsessions%3F1739470944550
2025-02-13 13:22:34.963	18:22:34.963 [ConfigProxy] error: 503 GET /user/sofia/api/sessions connect EHOSTUNREACH 10.0.128.108:8888
2025-02-13 13:22:25.482	18:22:25.481 [ConfigProxy] debug: PROXY WS /user/sofia/api/kernels/d686c622-bfe9-4446-9cf3-866de0000525/channels to http://10.0.128.108:8888
2025-02-13 13:22:25.201	18:22:25.201 [ConfigProxy] debug: PROXY WEB /user/sofia/api/kernels/431d4c9f-446e-462e-b892-b5cb10a3a236 to http://10.0.128.108:8888
2025-02-13 13:22:24.979	18:22:24.979 [ConfigProxy] debug: Requesting custom error page: http://hub:8081/hub/error/503?url=%2Fuser%2Fsofia%2Fapi%2Fkernels%2F431d4c9f-446e-462e-b892-b5cb10a3a236%2Fchannels%3Fsession_id%3D743ae84d-8c0a-4a9d-9d7b-501b23435ae1
2025-02-13 13:22:24.979	18:22:24.979 [ConfigProxy] error: 503 GET /user/sofia/api/kernels/431d4c9f-446e-462e-b892-b5cb10a3a236/channels connect EHOSTUNREACH 10.0.128.108:8888
2025-02-13 13:22:24.697	18:22:24.697 [ConfigProxy] debug: PROXY WEB /user/sofia/api/sessions to http://10.0.128.108:8888
2025-02-13 13:22:24.696	18:22:24.696 [ConfigProxy] debug: PROXY WEB /user/sofia/api/kernels to http://10.0.128.108:8888
2025-02-13 13:22:24.692	18:22:24.692 [ConfigProxy] debug: PROXY WEB /user/sofia/api/contents/shared/adata/adata_bi2024_unbiased_CRISPR/mphages/pngs to http://10.0.128.108:8888
2025-02-13 13:22:23.896	18:22:23.896 [ConfigProxy] debug: PROXY WEB /user/sofia/lab/api/workspaces to http://10.0.128.108:8888
2025-02-13 13:22:23.882	18:22:23.881 [ConfigProxy] debug: PROXY WEB /user/sofia/api/terminals to http://10.0.128.108:8888
2025-02-13 13:22:20.022	18:22:20.022 [ConfigProxy] debug: PROXY WEB /user/sofia/api/metrics/v1 to http://10.0.128.108:8888
2025-02-13 13:22:19.922	18:22:19.922 [ConfigProxy] debug: PROXY WEB /user/sofia/api/metrics/v1 to http://10.0.128.108:8888
2025-02-13 13:22:19.066	18:22:19.066 [ConfigProxy] debug: PROXY WEB /user/sofia/api/me to http://10.0.128.108:8888
2025-02-13 13:22:19.062	18:22:19.061 [ConfigProxy] debug: PROXY WEB /user/sofia/api/kernelspecs to http://10.0.128.108:8888
2025-02-13 13:22:18.786	18:22:18.786 [ConfigProxy] debug: PROXY WS /user/sofia/api/kernels/431d4c9f-446e-462e-b892-b5cb10a3a236/channels to http://10.0.128.108:8888
2025-02-13 13:22:14.792	18:22:14.792 [ConfigProxy] debug: PROXY WEB /user/sofia/api/metrics/v1 to http://10.0.128.108:8888

single user pod:

2025-02-13 13:25:06.421	[I 2025-02-13 18:25:06.421 LabApp] Extension Manager is 'pypi'.
2025-02-13 13:25:06.420	[I 2025-02-13 18:25:06.420 LabApp] JupyterLab application directory is /opt/conda/envs/jupyter_env/share/jupyter/lab
2025-02-13 13:25:06.420	[I 2025-02-13 18:25:06.420 LabApp] JupyterLab extension loaded from /opt/conda/envs/jupyter_env/lib/python3.9/site-packages/jupyterlab
2025-02-13 13:25:06.415	[I 2025-02-13 18:25:06.415 ServerApp] jupyterhub | extension was successfully loaded.
2025-02-13 13:25:06.415	[I 2025-02-13 18:25:06.414 JupyterHubSingleUser] Updating Hub with activity every 300 seconds
2025-02-13 13:25:06.414	[W 2025-02-13 18:25:06.414 JupyterHubSingleUser] jupyterhub version 4.0.2 != jupyterhub-singleuser version 5.2.1. This could cause failure to authenticate and result in redirect loops!
2025-02-13 13:25:06.406	[I 2025-02-13 18:25:06.406 ServerApp] jupyter_server_terminals | extension was successfully loaded.
2025-02-13 13:25:06.405	[I 2025-02-13 18:25:06.405 ServerApp] jupyter_resource_usage | extension was successfully loaded.
2025-02-13 13:25:06.404	[I 2025-02-13 18:25:06.404 ServerApp] jupyter_lsp | extension was successfully loaded.
2025-02-13 13:25:06.401	[I 2025-02-13 18:25:06.401 ServerApp] notebook_shim | extension was successfully loaded.
2025-02-13 13:25:06.381	[I 2025-02-13 18:25:06.381 ServerApp] notebook_shim | extension was successfully linked.
2025-02-13 13:25:06.139	[I 2025-02-13 18:25:06.139 ServerApp] jupyterlab_jupytext | extension was successfully linked.
2025-02-13 13:25:06.139	[I 2025-02-13 18:25:06.139 ServerApp] jupyterlab | extension was successfully linked.
2025-02-13 13:25:06.136	[W 2025-02-13 18:25:06.136 LabApp] 'extra_template_paths' was found in both NotebookApp and ServerApp. This is likely a recent change. This config will only be set in NotebookApp. Please check if you should also config these traits in ServerApp for your purpose.
2025-02-13 13:25:06.134	[I 2025-02-13 18:25:06.134 ServerApp] jupyterhub | extension was successfully linked.
2025-02-13 13:25:06.131	[I 2025-02-13 18:25:06.131 JupyterHubSingleUser] Using default url from server extension lab: /lab
2025-02-13 13:25:06.131	[I 2025-02-13 18:25:06.131 JupyterHubSingleUser] Starting jupyterhub single-user server extension version 5.2.1
2025-02-13 13:25:06.131	[I 2025-02-13 18:25:06.130 ServerApp] jupyter_server_terminals | extension was successfully linked.
2025-02-13 13:25:06.126	[I 2025-02-13 18:25:06.126 ServerApp] jupyter_resource_usage | extension was successfully linked.
2025-02-13 13:25:06.126	[I 2025-02-13 18:25:06.126 ServerApp] jupyter_lsp | extension was successfully linked.
2025-02-13 13:22:15.946	[I 2025-02-13 18:22:15.932 ServerApp] Kernel shutdown: 864fb092-8e4d-41e8-af20-a2f3af908607
2025-02-13 13:22:15.946	[I 2025-02-13 18:22:15.932 ServerApp] Kernel shutdown: 6f7e24a0-ae0d-4273-a442-5996f3abad90
2025-02-13 13:22:15.946	[I 2025-02-13 18:22:15.932 ServerApp] Kernel shutdown: d686c622-bfe9-4446-9cf3-866de0000525
2025-02-13 13:22:15.946	[I 2025-02-13 18:22:15.932 ServerApp] Kernel shutdown: 77623e53-c80a-4682-937b-a7d2da91c481
2025-02-13 13:22:15.946	[I 2025-02-13 18:22:15.932 ServerApp] Kernel shutdown: 21b00990-f0e8-45b9-a9a5-4d7dcf1a908a
2025-02-13 13:22:15.946	[I 2025-02-13 18:22:15.931 ServerApp] Kernel shutdown: 431d4c9f-446e-462e-b892-b5cb10a3a236
2025-02-13 13:22:15.946	[I 2025-02-13 18:22:15.931 ServerApp] Kernel shutdown: df8548fd-6b3d-4013-b418-880c6e00ff6c
2025-02-13 13:22:15.946	[I 2025-02-13 18:22:15.931 ServerApp] Kernel shutdown: 358d7cbb-400e-49a3-bd33-d6c2eb3ab740
2025-02-13 13:22:15.946	[I 2025-02-13 18:22:15.931 ServerApp] Shutting down 8 kernels
2025-02-13 13:22:15.946	[I 2025-02-13 18:22:15.930 ServerApp] Shutting down 7 extensions
2025-02-13 13:22:15.946	[C 2025-02-13 18:22:15.929 ServerApp] received signal 15, stopping
2025-02-13 13:22:14.867	[I 2025-02-13 18:22:14.867 ServerApp] 200 GET /user/sofia/api/metrics/v1?1739470934650 (sofia@10.0.128.124) 73.74ms
2025-02-13 13:22:14.778	[I 2025-02-13 18:22:14.770 ServerApp] 200 GET /user/sofia/api/metrics/v1?1739470934549 (sofia@10.0.128.124) 77.80ms
2025-02-13 13:22:14.518	[I 2025-02-13 18:22:14.517 ServerApp] 200 GET /user/sofia/api/kernels?1739470934139 (sofia@10.0.128.124) 3.22ms
2025-02-13 13:22:14.516	[I 2025-02-13 18:22:14.516 ServerApp] 200 GET /user/sofia/api/sessions?1739470934139 (sofia@10.0.128.124) 2.22ms
2025-02-13 13:22:14.498	[I 2025-02-13 18:22:14.498 ServerApp] 200 GET /user/sofia/api/contents/shared/adata/adata_bi2024_unbiased_CRISPR/mphages/pngs?content=1&hash=0&1739470934133 (sofia@10.0.128.124) 221.94ms
2025-02-13 13:22:13.744	[I 2025-02-13 18:22:13.744 ServerApp] 200 GET /user/sofia/lab/api/workspaces?1739470933597 (sofia@10.0.128.124) 3.96ms
2025-02-13 13:22:13.731	[I 2025-02-13 18:22:13.731 ServerApp] 200 GET /user/sofia/api/terminals?1739470933587 (sofia@10.0.128.124) 1.54ms
2025-02-13 13:22:09.647	[I 2025-02-13 18:22:09.646 ServerApp] 200 GET /user/sofia/api/metrics/v1?1739470929428 (sofia@10.0.128.124) 75.68ms
2025-02-13 13:22:09.544	[I 2025-02-13 18:22:09.544 ServerApp] 200 GET /user/sofia/api/metrics/v1?1739470929322 (sofia@10.0.128.124) 76.88ms
2025-02-13 13:22:04.405	[I 2025-02-13 18:22:04.404 ServerApp] 200 GET /user/sofia/api/metrics/v1?1739470924151 (sofia@10.0.128.124) 73.63ms
2025-02-13 13:22:04.317	[I 2025-02-13 18:22:04.316 ServerApp] 200 GET /user/sofia/api/metrics/v1?1739470924093 (sofia@10.0.128.124) 76.33ms
2025-02-13 13:22:04.108	[I 2025-02-13 18:22:04.108 ServerApp] 200 GET /user/sofia/api/sessions?1739470923742 (sofia@10.0.128.124) 3.11ms
2025-02-13 13:22:04.106	[I 2025-02-13 18:22:04.106 ServerApp] 200 GET /user/sofia/api/kernels?1739470923742 (sofia@10.0.128.124) 1.88ms
2025-02-13 13:22:04.087	[I 2025-02-13 18:22:04.087 ServerApp] 200 GET /user/sofia/api/contents/shared/adata/adata_bi2024_unbiased_CRISPR/mphages/pngs?content=1&hash=0&1739470923737 (sofia@10.0.128.124) 201.03ms
2025-02-13 13:22:03.591	[I 2025-02-13 18:22:03.591 ServerApp] 200 GET /user/sofia/lab/api/workspaces?1739470923444 (sofia@10.0.128.124) 4.19ms
2025-02-13 13:22:03.581	[I 2025-02-13 18:22:03.581 ServerApp] 200 GET /user/sofia/api/terminals?1739470923434 (sofia@10.0.128.124) 1.54ms
2025-02-13 13:21:59.148	[I 2025-02-13 18:21:59.148 ServerApp] 200 GET /user/sofia/api/metrics/v1?1739470918907 (sofia@10.0.128.124) 88.32ms
2025-02-13 13:21:59.054	[I 2025-02-13 18:21:59.048 ServerApp] 200 GET /user/sofia/api/metrics/v1?1739470918827 (sofia@10.0.128.124) 76.24ms
2025-02-13 13:21:53.903	[I 2025-02-13 18:21:53.903 ServerApp] 200 GET /user/sofia/api/metrics/v1?1739470913684 (sofia@10.0.128.124) 64.50ms
2025-02-13 13:21:53.824	[I 2025-02-13 18:21:53.823 ServerApp] 200 GET /user/sofia/api/metrics/v1?1739470913591 (sofia@10.0.128.124) 88.97ms
2025-02-13 13:21:53.728	[I 2025-02-13 18:21:53.728 ServerApp] 200 GET /user/sofia/api/sessions?1739470913323 (sofia@10.0.128.124) 3.25ms
2025-02-13 13:21:53.727	[I 2025-02-13 18:21:53.727 ServerApp] 200 GET /user/sofia/api/kernels?1739470913323 (sofia@10.0.128.124) 1.98ms
2025-02-13 13:21:53.708	[I 2025-02-13 18:21:53.708 ServerApp] 200 GET /user/sofia/api/contents/shared/adata/adata_bi2024_unbiased_CRISPR/mphages/pngs?content=1&hash=0&1739470913316 (sofia@10.0.128.124) 248.10ms

mahendrapaipuri · February 13, 2025, 7:07pm

Thanks for the logs @Dylan_Mendonca.

The single user pod logs that posted are not from single user pod. I think you have miscopied them and copied from JupyterHub logs. I think more clues might be in single user pod logs on why the pod has been stopped!!

Dylan_Mendonca · February 13, 2025, 7:23pm

Whoops, you’re right, I just edited my comment above. Looks like a signal 15 on a singleuser pod triggers the entire cascade?

I’ve heard that these errors are arbitrary and can encapsulate a lot of different things - how do I diagnose this?

mahendrapaipuri · February 13, 2025, 8:22pm

Cheers for the logs.

Seems like your single user pod received the signal at 2025-02-13 18:22:15.929. The JupyterHub logs that you posted are only from 2025-02-13 18:22:16.459. Could you post hub logs from, say 2025-02-13 18:22:13.000 to check if there are any events that are leading to sending this signal to single user server?

minrk · February 14, 2025, 8:52am

What’s the pods watcher and why is it timing out? Is this normal?

The answer to the second question is: probably.

The pods watcher is how the Hub notices when pods are started/stopped/etc. When the watcher has not received any events in a certain period of time (the timeout you see in the logs), the watcher can’t know if that is because there is a communication problem and events aren’t being delivered (this is known to happen), or if there just weren’t any events in that time (more likely for low activity Hubs). The vast majority of the time, this message only indicates idleness and is not a problem. But if pods are actually stopping and the Hub isn’t noticing, this is a symptom of the real issue.

A 503 like this can happen when the Hub thinks a user’s server is running but it isn’t. It happens when there is a record for the user’s server (/user/name/) routed to the user’s server, but the server is not there or not responding.

There are a few typical paths to this error:

the server has stopped or hung, but the pod has not been deleted. This can happen if the notebook server process shuts down the http server socket, but the process does not fully exit for some reason (could be a hang or deadlock in a server extension, for example). If we attached Kuberentes health checks to user pods, this would solve itself when kubernetes notices the server is unresponsive. The risk is slowness and hangs in the notebook server could result in premature shutdown. The immediate fix is to try to identify what is still running preventing process exit.
The pod has been deleted, but JupyterHub didn’t notice. This would be a problem in what’s called the “pod reflector”, and is harder to debug. The timeout logs in your hub indicate that either there are no pod events, or pod events are not being delivered.
(much less likely, but not impossible) networking configuration changes in kubernetes changed the IP for running pods. I don’t know exactly

All 3 of these cases should be fixable by either a manual server restart from the Hub page (users can do this themselves), or automatically caught and fixed when the Hub reconciles its state when the Hub pod restarts.

EHOSTUNREACH suggests case 2 or 3 because the ip doesn’t resolve, while case 1 will usually report as ECONNREFUSED or ECONNRESET.

Dylan_Mendonca · February 14, 2025, 5:59pm

Hi @minrk, thank you for the detailed explanations and feedback!

The existing solution is, as you mentioned, to turn the server on and off via the Hub page. We’ve been doing this but this problem manifests pretty frequently which is why we want to address it.

A few follow up questions:

I’ve noticed that users open JupyterHub in many browser tabs? I’m wondering if this interferes with HTTP/S routing (highly unlikely because I think it just routes to the IP address for /user/name) but just thought I’d ask?
Is there any impact of reading large files from Jupyter (eg: loading up a huge CSV file in the Jupyter file explorer, for example)? Do you think that this could have an impact on the proxy and cause networking issues?
How can I enable debug logging on the single user pods?

I’ll try to find more instances of these errors and figure out how to improve my logging. Will update here with findings.

Any additional avenues to explore/strategies to consider would be much appreciated!

Dylan_Mendonca · February 14, 2025, 10:21pm

We were using an old helm chart version (like 3.1).

Bumped it up to 4.1.0 - hopefully this fixes things?

minrk · February 19, 2025, 10:13am

Quite possibly! One can always hope.

No, that is unlikely. Though more open tabs make it more likely that requests will be made to servers that are in the process of shutting down, which should expect error message (initially 502/503 as it shuts down, then 424 once the server is fully unregistered).

Yes, this is possible. How large? The Jupyter server is a fairly simple single-threaded async server. File reads are async in the default implementation (if you have a non-default contents_manager_class, this may be relevant), but assembling the response happens in the main thread. This should manifest as ETIMEDOUT, I would think.

How can I enable debug logging on the single user pods?

e.g. in z2jh, set the top-level:

debug:
  enabled: true

to enable debug logging everywhere (it’s a lot of logs).
Or for just single-user servers:

hub:
  config:
    Spawner:
      args:
        - "--debug"

Dylan_Mendonca · February 26, 2025, 6:06pm

We’ve made a lot of progress based on your feedback, thank you so much for all the help!

A few solutions for any person or LLM that stumbles upon this issue:

We had the simple proxy[https][enabled]: true implementation in Z2JH, which creates a “Classic Load Balancer” in AWS (prev gen). The timeout on this was extremely low by default and we increased it. We will be migrating to the Application Load Balancer over the coming days by configuring an ingress in our helm chart
We also increased ephemeral storage limits and guarantees. We use in-house Dockerfiles that allow people to install packages on their local storage volumes via pip or the associated R installation commands. Unsure if this was related but it might be. We will be migrating to the official JupyterHub docker images eventually.

extraResource:
    limits:
      ephemeral-storage: "5Gi"
    guarantees:
      ephemeral-storage: "5Gi"

The stability of our system has increased, but we still have one pending issue. It follows this pattern:

Pod logs disappear
There is a XSRF cookie refresh event on the hub
Returns a 503

Single user (logs end and do not continue past the point included here)

2025-02-26 12:59:58.181	
[I 2025-02-26 17:59:58.181 ServerApp] 200 GET /user/sam/api/metrics/v1?1740592797801 (sam@10.0.128.119) 105.43ms
2025-02-26 12:59:59.077	
[I 2025-02-26 17:59:59.077 ServerApp] 200 GET /user/sam/api/metrics/v1?1740592798699 (sam@10.0.128.119) 105.71ms
2025-02-26 12:59:59.101	
[I 2025-02-26 17:59:59.101 ServerApp] 200 GET /user/sam/api/kernels?1740592798827 (sam@10.0.128.119) 1.07ms
2025-02-26 13:00:00.642	
[I 2025-02-26 18:00:00.642 ServerApp] 200 GET /user/sam/lab/api/workspaces?1740592800366 (sam@10.0.128.119) 1.82ms

Hub:

2025-02-26 13:00:00.625	
[D 2025-02-26 18:00:00.625 JupyterHub log:192] 200 GET /hub/health (@10.0.0.109) 0.99ms
2025-02-26 13:00:01.671	
[D 2025-02-26 18:00:01.671 JupyterHub proxy:925] Proxy: Fetching GET http://proxy-api:8001/api/routes
2025-02-26 13:00:01.708	
[D 2025-02-26 18:00:01.707 JupyterHub proxy:392] Checking routes
2025-02-26 13:00:02.327	
[I 2025-02-26 18:00:02.327 JupyterHub _xsrf_utils:125] Setting new xsrf cookie for b'None:IHg1Scov4tqZ5wRgdEnQwFGWPfQpWLtAHH-9uD4obT0=' {'path': '/hub/', 'max_age': 3600}
2025-02-26 13:00:02.328	
[D 2025-02-26 18:00:02.327 JupyterHub pages:698] Using default error template for 503
2025-02-26 13:00:02.329	
[I 2025-02-26 18:00:02.329 JupyterHub log:192] 200 GET /hub/error/503?url=%2Fuser%2Fsam%2Fapi%2Fsessions%3F1740592802041 (@10.0.64.212) 2.12ms
2025-02-26 13:00:02.357	
[I 2025-02-26 18:00:02.357 JupyterHub _xsrf_utils:125] Setting new xsrf cookie for b'None:IHg1Scov4tqZ5wRgdEnQwFGWPfQpWLtAHH-9uD4obT0=' {'path': '/hub/', 'max_age': 3600}
2025-02-26 13:00:02.357	
[D 2025-02-26 18:00:02.357 JupyterHub pages:698] Using default error template for 503

Proxy:

2025-02-26 13:00:02.354	
18:00:02.353 [ConfigProxy] debug: PROXY WS /user/sam/api/kernels/859b23cf-fb26-4ef2-94ed-2d833ccd264f/channels to http://10.0.128.201:8888
2025-02-26 13:00:02.355	
18:00:02.355 [ConfigProxy] error: 503 GET /user/sam/api/kernels/859b23cf-fb26-4ef2-94ed-2d833ccd264f/channels connect ECONNREFUSED 10.0.128.201:8888
2025-02-26 13:00:02.355	
18:00:02.355 [ConfigProxy] debug: Requesting custom error page: http://hub:8081/hub/error/503?url=%2Fuser%2Fsam%2Fapi%2Fkernels%2F859b23cf-fb26-4ef2-94ed-2d833ccd264f%2Fchannels%3Fsession_id%3D0f0f6dc2-a2e3-4bac-a8f7-b0a23611ef60
2025-02-26 13:00:02.404	
18:00:02.404 [ConfigProxy] debug: PROXY WEB /user/sam/api/terminals to http://10.0.128.201:8888
2025-02-26 13:00:02.405	
18:00:02.405 [ConfigProxy] error: 503 GET /user/sam/api/terminals connect ECONNREFUSED 10.0.128.201:8888

We’ve covered this in our earlier discussions i.e. pod disruption. We’ve done:

Some logging on our kube-scheduler and noticed that nodes do not shut down so it is likely a pod issue, as you previously mentioned
We’ve increased log levels to debug on our single user pod and there’s no apparent error, they just disappear at a certain point

Any ideas on how to further debug this?

manics · February 26, 2025, 7:25pm

It sounds like either your pod has frozen, or been abruptly killed. Can you share your hub logs starting from a few minutes before it freezes?

You can also try running kubectl describe pod <podname> a few times after it’s frozen, it may take a few minutes before you get a useful warning or error from kubernetes. And try running kubectl get events --watch from before you start the pod to after it fails.

Topic		Replies	Views
We've been seeing 503s occasionally for user servers Zero to JupyterHub on Kubernetes	1	2175	July 12, 2020
GKE - 503 : Service Unavailable Your server appears to be down. Try restarting it from the hub Zero to JupyterHub on Kubernetes jupyterhub , help-wanted	2	1698	March 22, 2022
JupyterHub starts, and when a user logs in, it fails to spawn a notebook server on the page JupyterHub how-to , help-wanted	3	1022	August 31, 2023
503 : Service Unavailable z2jh Zero to JupyterHub on Kubernetes jupyterhub	0	205	January 19, 2024
Z2jh 500 : Internal Server Error Zero to JupyterHub on Kubernetes jupyterhub , help-wanted	3	465	January 28, 2024

Occasional 503: Service Unavailable errors on Z2JH for Users

Related topics