Zero to jupyterhub on k3s, some user containers spawn but not others

Hello everyone,

I was hoping someone could help with a multi-node k3s deployment issue I’m having with jupyterhub, deployed through helm. I have tried chart version 1.2.0 and the latest 2.2.0 (1.1.3-n470.h217c7977). Essentially, the hub and some single user containers complete the start process, but others do not.

System and deployment

  • Ubuntu 20
  • Every single user container is provisioned onto different k3s worker/agent nodes – this is a side effect of nvidia gpu drivers and vm setup I’m using
  • Each k3s node is provisioned exactly the same way and have the same specs through automation. So, it’s highly unlikely that there are system/configuration differences at the node level
  • the http timeout is set to 600
  • the start timeout is set to 600
  • ufw is not active (but there is an external firewall that shouldn’t be limiting port access within the cluster)

Things I’ve tried:

  • docker or containerd
  • with and without traefik ingress (when using the traefik ingress, i set the enable ingress setting the jupyterhub chart config)
  • setting a different private ip address space in k3s
  • increase memory in both the hub, proxy, and single user containers.
  • ensuring the jupyterhub core containers are matched to a dedicated node

For the single user containers that complete the startup process:

[I 2022-05-16 15:53:52.368 SingleUserLabApp serverapp:2672] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
[I 2022-05-16 15:53:52.371 SingleUserLabApp mixins:596] Updating Hub with activity every 300 seconds
[I 2022-05-16 15:53:55.648 SingleUserLabApp log:189] 302 GET /user/user1/ -> /user/user1/lab? (@192.168.0.16) 0.74ms
[I 2022-05-16 15:53:55.805 SingleUserLabApp log:189] 302 GET /user/user1/ -> /user/user1/lab? (@192.168.0.1) 0.57ms

For the single user containers that don’t complete the startup process, they just hang here:

[I 2022-05-16 15:54:15.436 SingleUserLabApp serverapp:2672] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
[I 2022-05-16 15:54:15.439 SingleUserLabApp mixins:596] Updating Hub with activity every 300 seconds

From the hub container, this is what I see:

[I 2022-05-16 15:54:03.240 JupyterHub log:189] 200 GET /hub/spawn/user5 (edwins@192.168.0.1) 9.03ms
[I 2022-05-16 15:54:05.352 JupyterHub provider:574] Creating oauth client jupyterhub-user-user5
[I 2022-05-16 15:54:05.368 JupyterHub spawner:2344] Attempting to create pvc claim-user5, with timeout 3
[I 2022-05-16 15:54:05.370 JupyterHub log:189] 302 POST /hub/spawn/user5 -> /hub/spawn-pending/user5 (edwins@192.168.0.1) 40.70ms
[I 2022-05-16 15:54:05.382 JupyterHub spawner:2302] Attempting to create pod jupyter-user5, with timeout 3
[I 2022-05-16 15:54:05.447 JupyterHub pages:402] user5 is pending spawn
[I 2022-05-16 15:54:05.448 JupyterHub log:189] 200 GET /hub/spawn-pending/user5 (edwins@192.168.0.1) 3.31ms
[I 2022-05-16 15:54:08.284 JupyterHub proxy:347] Checking routes
[I 2022-05-16 15:54:15.436 JupyterHub log:189] 200 GET /hub/api (@192.168.5.9) 0.56ms
[I 2022-05-16 15:54:15.460 JupyterHub log:189] 200 POST /hub/api/users/user5/activity (user5@192.168.5.9) 18.35ms
...
[W 2022-05-16 16:03:43.622 JupyterHub user:767] user5's server never showed up at http://192.168.5.9:8888/user/user5/ after 600 seconds. Giving up
[I 2022-05-16 16:03:43.623 JupyterHub spawner:2620] Deleting pod default/jupyter-user5
[E 2022-05-16 16:03:45.730 JupyterHub gen:623] Exception in Future <Task finished name='Task-348' coro=<BaseHandler.spawn_single_user.<locals>.finish_user_spawn() done, defined at /usr/local/lib/python3.8/dist-packages/jupyterhub/handlers/base.py:900> exception=TimeoutError("Server at http://192.168.5.9:8888/user/user5/ didn't respond in 600 seconds")> after timeout
    Traceback (most recent call last):
      File "/usr/local/lib/python3.8/dist-packages/tornado/gen.py", line 618, in error_callback
        future.result()
      File "/usr/local/lib/python3.8/dist-packages/jupyterhub/handlers/base.py", line 907, in finish_user_spawn
        await spawn_future
      File "/usr/local/lib/python3.8/dist-packages/jupyterhub/user.py", line 748, in spawn
        await self._wait_up(spawner)
      File "/usr/local/lib/python3.8/dist-packages/jupyterhub/user.py", line 795, in _wait_up
        raise e
      File "/usr/local/lib/python3.8/dist-packages/jupyterhub/user.py", line 762, in _wait_up
        resp = await server.wait_up(
      File "/usr/local/lib/python3.8/dist-packages/jupyterhub/utils.py", line 236, in wait_for_http_server
        re = await exponential_backoff(
      File "/usr/local/lib/python3.8/dist-packages/jupyterhub/utils.py", line 184, in exponential_backoff
        raise TimeoutError(fail_message)
    TimeoutError: Server at http://192.168.5.9:8888/user/user5/ didn't respond in 600 seconds

Any help would be greatly appreciated. This is for a workshop doing AI/ML using Jupyterhub .

Thank you,
Edwin

If someone happens to be interested in the cause of this issue, it looks like k3s has an embedded network policy controller, even though flannel – the default k3s cni plugin – doesn’t support network policies AFAIK. Perhaps jupyterhub’s network policies are not compatible with k3s network policy controller.

I haven’t tested an unmodified K3s for a while. We use K3S for Z2JH CI testing but with Calico GitHub - jupyterhub/action-k3s-helm: A GitHub action to install K3S, Calico, and Helm.