Zero to jupyterhub on k3s, some user containers spawn but not others

Hello everyone,

I was hoping someone could help with a multi-node k3s deployment issue I’m having with jupyterhub, deployed through helm. I have tried chart version 1.2.0 and the latest 2.2.0 (1.1.3-n470.h217c7977). Essentially, the hub and some single user containers complete the start process, but others do not.

System and deployment

  • Ubuntu 20
  • Every single user container is provisioned onto different k3s worker/agent nodes – this is a side effect of nvidia gpu drivers and vm setup I’m using
  • Each k3s node is provisioned exactly the same way and have the same specs through automation. So, it’s highly unlikely that there are system/configuration differences at the node level
  • the http timeout is set to 600
  • the start timeout is set to 600
  • ufw is not active (but there is an external firewall that shouldn’t be limiting port access within the cluster)

Things I’ve tried:

  • docker or containerd
  • with and without traefik ingress (when using the traefik ingress, i set the enable ingress setting the jupyterhub chart config)
  • setting a different private ip address space in k3s
  • increase memory in both the hub, proxy, and single user containers.
  • ensuring the jupyterhub core containers are matched to a dedicated node

For the single user containers that complete the startup process:

[I 2022-05-16 15:53:52.368 SingleUserLabApp serverapp:2672] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
[I 2022-05-16 15:53:52.371 SingleUserLabApp mixins:596] Updating Hub with activity every 300 seconds
[I 2022-05-16 15:53:55.648 SingleUserLabApp log:189] 302 GET /user/user1/ -> /user/user1/lab? (@192.168.0.16) 0.74ms
[I 2022-05-16 15:53:55.805 SingleUserLabApp log:189] 302 GET /user/user1/ -> /user/user1/lab? (@192.168.0.1) 0.57ms

For the single user containers that don’t complete the startup process, they just hang here:

[I 2022-05-16 15:54:15.436 SingleUserLabApp serverapp:2672] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
[I 2022-05-16 15:54:15.439 SingleUserLabApp mixins:596] Updating Hub with activity every 300 seconds

From the hub container, this is what I see:

[I 2022-05-16 15:54:03.240 JupyterHub log:189] 200 GET /hub/spawn/user5 (edwins@192.168.0.1) 9.03ms
[I 2022-05-16 15:54:05.352 JupyterHub provider:574] Creating oauth client jupyterhub-user-user5
[I 2022-05-16 15:54:05.368 JupyterHub spawner:2344] Attempting to create pvc claim-user5, with timeout 3
[I 2022-05-16 15:54:05.370 JupyterHub log:189] 302 POST /hub/spawn/user5 -> /hub/spawn-pending/user5 (edwins@192.168.0.1) 40.70ms
[I 2022-05-16 15:54:05.382 JupyterHub spawner:2302] Attempting to create pod jupyter-user5, with timeout 3
[I 2022-05-16 15:54:05.447 JupyterHub pages:402] user5 is pending spawn
[I 2022-05-16 15:54:05.448 JupyterHub log:189] 200 GET /hub/spawn-pending/user5 (edwins@192.168.0.1) 3.31ms
[I 2022-05-16 15:54:08.284 JupyterHub proxy:347] Checking routes
[I 2022-05-16 15:54:15.436 JupyterHub log:189] 200 GET /hub/api (@192.168.5.9) 0.56ms
[I 2022-05-16 15:54:15.460 JupyterHub log:189] 200 POST /hub/api/users/user5/activity (user5@192.168.5.9) 18.35ms
...
[W 2022-05-16 16:03:43.622 JupyterHub user:767] user5's server never showed up at http://192.168.5.9:8888/user/user5/ after 600 seconds. Giving up
[I 2022-05-16 16:03:43.623 JupyterHub spawner:2620] Deleting pod default/jupyter-user5
[E 2022-05-16 16:03:45.730 JupyterHub gen:623] Exception in Future <Task finished name='Task-348' coro=<BaseHandler.spawn_single_user.<locals>.finish_user_spawn() done, defined at /usr/local/lib/python3.8/dist-packages/jupyterhub/handlers/base.py:900> exception=TimeoutError("Server at http://192.168.5.9:8888/user/user5/ didn't respond in 600 seconds")> after timeout
    Traceback (most recent call last):
      File "/usr/local/lib/python3.8/dist-packages/tornado/gen.py", line 618, in error_callback
        future.result()
      File "/usr/local/lib/python3.8/dist-packages/jupyterhub/handlers/base.py", line 907, in finish_user_spawn
        await spawn_future
      File "/usr/local/lib/python3.8/dist-packages/jupyterhub/user.py", line 748, in spawn
        await self._wait_up(spawner)
      File "/usr/local/lib/python3.8/dist-packages/jupyterhub/user.py", line 795, in _wait_up
        raise e
      File "/usr/local/lib/python3.8/dist-packages/jupyterhub/user.py", line 762, in _wait_up
        resp = await server.wait_up(
      File "/usr/local/lib/python3.8/dist-packages/jupyterhub/utils.py", line 236, in wait_for_http_server
        re = await exponential_backoff(
      File "/usr/local/lib/python3.8/dist-packages/jupyterhub/utils.py", line 184, in exponential_backoff
        raise TimeoutError(fail_message)
    TimeoutError: Server at http://192.168.5.9:8888/user/user5/ didn't respond in 600 seconds

Any help would be greatly appreciated. This is for a workshop doing AI/ML using Jupyterhub .

Thank you,
Edwin

If someone happens to be interested in the cause of this issue, it looks like k3s has an embedded network policy controller, even though flannel – the default k3s cni plugin – doesn’t support network policies AFAIK. Perhaps jupyterhub’s network policies are not compatible with k3s network policy controller.

I haven’t tested an unmodified K3s for a while. We use K3S for Z2JH CI testing but with Calico GitHub - jupyterhub/action-k3s-helm: A GitHub action to install K3S, Calico, and Helm.

Was this resolved? I am having a similar issue with k3s and a basic, default config deployment.
My cluster has working pod-pod jobs (tested it following this: Job with Pod-to-Pod Communication | Kubernetes ).
Should I read more about k3s network policies that might be blocking the spawn process?
The issue is even more weird when we consider some scenarios:

  • some users are allowed while others are blocked
  • some successfuly spawn but when some other user logins they are expropriated

My cluster is one master-node and one worker node (baremetal).

Example of log with a new user failing to get a pod, while another runs in the background…

[I 2023-02-22 20:30:38.956 JupyterHub roles:238] Adding role user for User: badboy
[I 2023-02-22 20:30:39.020 JupyterHub base:810] User logged in: badboy
[I 2023-02-22 20:30:39.043 JupyterHub log:186] 302 POST /hub/login?next= -> /hub/spawn (badboy@::ffff:10.42.0.0) 107.34ms
[I 2023-02-22 20:30:39.174 JupyterHub provider:651] Creating oauth client jupyterhub-user-badboy
[I 2023-02-22 20:30:39.236 JupyterHub spawner:2509] Attempting to create pvc claim-badboy, with timeout 3
[I 2023-02-22 20:30:39.245 JupyterHub log:186] 302 GET /hub/spawn -> /hub/spawn-pending/badboy (badboy@::ffff:10.42.0.0) 154.45ms
[I 2023-02-22 20:30:39.318 JupyterHub pages:394] badboy is pending spawn
[I 2023-02-22 20:30:39.324 JupyterHub log:186] 200 GET /hub/spawn-pending/badboy (badboy@::ffff:10.42.0.0) 16.29ms
[I 2023-02-22 20:30:39.353 JupyterHub spawner:2469] Attempting to create pod jupyter-badboy, with timeout 3
[I 2023-02-22 20:30:49.292 JupyterHub log:186] 200 GET /hub/api (@10.42.1.40) 3.11ms
[I 2023-02-22 20:30:49.409 JupyterHub log:186] 200 POST /hub/api/users/badboy/activity (badboy@10.42.1.40) 60.02ms
[W 2023-02-22 20:31:16.237 JupyterHub user:881] badboy's server never showed up at http://10.42.1.40:8888/user/badboy/ after 30 seconds. Giving up.

    Common causes of this timeout, and debugging tips:

    1. The server didn't finish starting,
       or it crashed due to a configuration issue.
       Check the single-user server's logs for hints at what needs fixing.
    2. The server started, but is not accessible at the specified URL.
       This may be a configuration issue specific to your chosen Spawner.
       Check the single-user server logs and resource to make sure the URL
       is correct and accessible from the Hub.
    3. (unlikely) Everything is working, but the server took too long to respond.
       To fix: increase `Spawner.http_timeout` configuration
       to a number of seconds that is enough for servers to become responsive.

[I 2023-02-22 20:31:16.240 JupyterHub spawner:2780] Deleting pod jhub/jupyter-badboy

I’ve tried disabling the k3s network policies using the /etc/rancher/k3s/config.yaml (I’ve added a line disable-network-policy: true) and it seemed to have fixed the issue, but after adding LDAP to the configuration the problem started happening again.
I will now try to use Calico instead.

I’m trying to deploy jupyterhub on k3s and have some problems. I was wondering if your implementation is publicly available?