Hello everyone,
I was hoping someone could help with a multi-node k3s deployment issue I’m having with jupyterhub, deployed through helm. I have tried chart version 1.2.0 and the latest 2.2.0 (1.1.3-n470.h217c7977). Essentially, the hub and some single user containers complete the start process, but others do not.
System and deployment
- Ubuntu 20
- Every single user container is provisioned onto different k3s worker/agent nodes – this is a side effect of nvidia gpu drivers and vm setup I’m using
- Each k3s node is provisioned exactly the same way and have the same specs through automation. So, it’s highly unlikely that there are system/configuration differences at the node level
- the http timeout is set to 600
- the start timeout is set to 600
- ufw is not active (but there is an external firewall that shouldn’t be limiting port access within the cluster)
Things I’ve tried:
- docker or containerd
- with and without traefik ingress (when using the traefik ingress, i set the enable ingress setting the jupyterhub chart config)
- setting a different private ip address space in k3s
- increase memory in both the hub, proxy, and single user containers.
- ensuring the jupyterhub core containers are matched to a dedicated node
For the single user containers that complete the startup process:
[I 2022-05-16 15:53:52.368 SingleUserLabApp serverapp:2672] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
[I 2022-05-16 15:53:52.371 SingleUserLabApp mixins:596] Updating Hub with activity every 300 seconds
[I 2022-05-16 15:53:55.648 SingleUserLabApp log:189] 302 GET /user/user1/ -> /user/user1/lab? (@192.168.0.16) 0.74ms
[I 2022-05-16 15:53:55.805 SingleUserLabApp log:189] 302 GET /user/user1/ -> /user/user1/lab? (@192.168.0.1) 0.57ms
For the single user containers that don’t complete the startup process, they just hang here:
[I 2022-05-16 15:54:15.436 SingleUserLabApp serverapp:2672] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
[I 2022-05-16 15:54:15.439 SingleUserLabApp mixins:596] Updating Hub with activity every 300 seconds
From the hub container, this is what I see:
[I 2022-05-16 15:54:03.240 JupyterHub log:189] 200 GET /hub/spawn/user5 (edwins@192.168.0.1) 9.03ms
[I 2022-05-16 15:54:05.352 JupyterHub provider:574] Creating oauth client jupyterhub-user-user5
[I 2022-05-16 15:54:05.368 JupyterHub spawner:2344] Attempting to create pvc claim-user5, with timeout 3
[I 2022-05-16 15:54:05.370 JupyterHub log:189] 302 POST /hub/spawn/user5 -> /hub/spawn-pending/user5 (edwins@192.168.0.1) 40.70ms
[I 2022-05-16 15:54:05.382 JupyterHub spawner:2302] Attempting to create pod jupyter-user5, with timeout 3
[I 2022-05-16 15:54:05.447 JupyterHub pages:402] user5 is pending spawn
[I 2022-05-16 15:54:05.448 JupyterHub log:189] 200 GET /hub/spawn-pending/user5 (edwins@192.168.0.1) 3.31ms
[I 2022-05-16 15:54:08.284 JupyterHub proxy:347] Checking routes
[I 2022-05-16 15:54:15.436 JupyterHub log:189] 200 GET /hub/api (@192.168.5.9) 0.56ms
[I 2022-05-16 15:54:15.460 JupyterHub log:189] 200 POST /hub/api/users/user5/activity (user5@192.168.5.9) 18.35ms
...
[W 2022-05-16 16:03:43.622 JupyterHub user:767] user5's server never showed up at http://192.168.5.9:8888/user/user5/ after 600 seconds. Giving up
[I 2022-05-16 16:03:43.623 JupyterHub spawner:2620] Deleting pod default/jupyter-user5
[E 2022-05-16 16:03:45.730 JupyterHub gen:623] Exception in Future <Task finished name='Task-348' coro=<BaseHandler.spawn_single_user.<locals>.finish_user_spawn() done, defined at /usr/local/lib/python3.8/dist-packages/jupyterhub/handlers/base.py:900> exception=TimeoutError("Server at http://192.168.5.9:8888/user/user5/ didn't respond in 600 seconds")> after timeout
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/tornado/gen.py", line 618, in error_callback
future.result()
File "/usr/local/lib/python3.8/dist-packages/jupyterhub/handlers/base.py", line 907, in finish_user_spawn
await spawn_future
File "/usr/local/lib/python3.8/dist-packages/jupyterhub/user.py", line 748, in spawn
await self._wait_up(spawner)
File "/usr/local/lib/python3.8/dist-packages/jupyterhub/user.py", line 795, in _wait_up
raise e
File "/usr/local/lib/python3.8/dist-packages/jupyterhub/user.py", line 762, in _wait_up
resp = await server.wait_up(
File "/usr/local/lib/python3.8/dist-packages/jupyterhub/utils.py", line 236, in wait_for_http_server
re = await exponential_backoff(
File "/usr/local/lib/python3.8/dist-packages/jupyterhub/utils.py", line 184, in exponential_backoff
raise TimeoutError(fail_message)
TimeoutError: Server at http://192.168.5.9:8888/user/user5/ didn't respond in 600 seconds
Any help would be greatly appreciated. This is for a workshop doing AI/ML using Jupyterhub .
Thank you,
Edwin