JupyterHub spawns not actually launching

I’m at a loss, hoping someone can point me to a resource. Had to reinstall my JupyterHub (bare metal, followed the zero-to- guide mostly, with some custom tweaks) due to some issues outside of the Hub itself. Managed to get it up and running this morning, I thought, in the same setup as previously: NFS for storage, dynamically created; OAuth2 working correctly. The only issue remaining is that the servers just … don’t spawn. 300 seconds, nothing happens, and then it times out.

There’s no logs I can find that explain what’s happening. The hub logs are clean. The proxy logs are clean. The pods representing the user have no content. They’re never getting assigned to a node, but I can’t find any details on where/why/what is preventing the server from spawning.

kubectl describe pods jupyter-USER

gives the basics, but no indications of problems I can see. Anyone have any thoughts I could follow to try to determine wtf is happening to prevent the hub from spawning servers for the users?

Further notes: journalctl has nothing; no errors being thrown I can see. kubectl log [various pods] is vacuous.

The contents of a

kubectl describe pods -n jhub jupyter-wburr

command run on a spinning-up but not ready pod containing a server for user wburr (myself), logged in to JupyterHub front-end via web.

Name: jupyter-wburr
Namespace: jhub
Priority: 0
Service Account: default
Node:
Labels: app=jupyterhub
chart=jupyterhub-1.2.0
component=singleuser-server
heritage=jupyterhub
hub.jupyter.org/network-access-hub=true
hub.jupyter.org/servername=
hub.jupyter.org/username=wburr
release=jhub1.5.0
Annotations: hub.jupyter.org/username: wburr
Status: Pending
IP:
IPs:
Init Containers:
block-cloud-metadata:
Image: jupyterhub/k8s-network-tools:1.2.0
Port:
Host Port:
Command:
iptables
-A
OUTPUT
-d
169.254.169.254
-j
DROP
Environment:
Mounts:
Containers:
notebook:
Image: jupyter/datascience-notebook:latest
Port: 8888/TCP
Host Port: 0/TCP
Args:
jupyterhub-singleuser
–ip=0.0.0.0
–port=8888
Limits:
cpu: 1
memory: 2147483648
Requests:
cpu: 500m
memory: 1073741824
Environment:
CPU_GUARANTEE: 0.5
CPU_LIMIT: 1.0
JPY_API_TOKEN:
JUPYTERHUB_ACTIVITY_URL: http://hub:8081/hub/api/users/wburr/activity
JUPYTERHUB_ADMIN_ACCESS: 1
JUPYTERHUB_API_TOKEN:
JUPYTERHUB_API_URL: http://hub:8081/hub/api
JUPYTERHUB_BASE_URL: /
JUPYTERHUB_CLIENT_ID: jupyterhub-user-wburr
JUPYTERHUB_HOST:
JUPYTERHUB_OAUTH_CALLBACK_URL: /user/wburr/oauth_callback
JUPYTERHUB_SERVER_NAME:
JUPYTERHUB_SERVICE_PREFIX: /user/wburr/
JUPYTERHUB_USER: wburr
JUPYTER_IMAGE: jupyter/datascience-notebook:latest
JUPYTER_IMAGE_SPEC: jupyter/datascience-notebook:latest
MEM_GUARANTEE: 1073741824
MEM_LIMIT: 2147483648
Mounts:
/home/jovyan from volume-wburr (rw)
Volumes:
volume-wburr:
Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
ClaimName: claim-wburr
ReadOnly: false
QoS Class: Burstable
Node-Selectors:
Tolerations: hub.jupyter.org/dedicated=user:NoSchedule
hub.jupyter.org_dedicated=user:NoSchedule
node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:

Hi @wesleyburr

I seem to be having the same issue today on my cluster on a new node that was started. To my knowledge, this was not an issue yesterday but only starting today. I can get a user pod to start, but the user is stuck at the “requesting server” page for 300s, and then moved on to a “Your server is stopping” page where they are stuck on forever.

image

I can see that the pod started on the node, and getting the status using kubectl as you did yields similar results (redacted a few items):

Name:                 jupyter-gibson
Namespace:            datahub-namespace
Priority:             0
Priority Class Name:  datahub-release-default-priority
Node:                 gke-datahub-cluster-user-pool-b8e3d238-47nx/10.150.0.45
Start Time:           Wed, 07 Sep 2022 18:25:37 +0000
Labels:               app=jupyterhub
                      chart=jupyterhub-1.1.3-n743.h730f014f
                      component=singleuser-server
                      heritage=jupyterhub
                      hub.jupyter.org/network-access-hub=true
                      hub.jupyter.org/servername=
                      hub.jupyter.org/username=gibson
                      release=datahub-release
Annotations:          hub.jupyter.org/username: gibson
Status:               Running
IP:                   <redacted>
IPs:
  IP:  <redacted>
Init Containers:
  block-cloud-metadata:
    Container ID:  containerd://<redacted>
    Image:         jupyterhub/k8s-network-tools:1.1.3-n740.h009d6525
    Image ID:      docker.io/jupyterhub/k8s-network-tools@sha256:7a0e52ccc95b18e047a9c42aa5f896fdf23f0a05405e7b3b5c5032903779d008
    Port:          <none>
    Host Port:     <none>
    Command:
      iptables
      -A
      OUTPUT
      -d
      169.254.169.254
      -j
      DROP
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Wed, 07 Sep 2022 18:25:55 +0000
      Finished:     Wed, 07 Sep 2022 18:25:55 +0000
    Ready:          True
    Restart Count:  0
    Environment:    <none>
    Mounts:         <none>
Containers:
  notebook:
    Container ID:   containerd://<redacted>
    Image:          taylorgibson/ncssm-datascience-notebook:fa22v1
    Image ID:       docker.io/taylorgibson/ncssm-datascience-notebook@sha256:0e079dcdd3e8e362665d54aedaa57f895a87edeecf9b08cb0afe0a5b00eb8c1c
    Port:           8888/TCP
    Host Port:      0/TCP
    State:          Running
      Started:      Wed, 07 Sep 2022 18:25:56 +0000
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:     1
      memory:  1073741824
    Requests:
      cpu:     200m
      memory:  805306368
    Environment:
      CPU_GUARANTEE:                  0.2
      CPU_LIMIT:                      1.0
      JPY_API_TOKEN:                  <redacted>
      JUPYTERHUB_ACTIVITY_URL:        http://hub:8081/hub/api/users/gibson/activity
      JUPYTERHUB_ADMIN_ACCESS:        1
      JUPYTERHUB_API_TOKEN:           <redacted>
      JUPYTERHUB_API_URL:             http://hub:8081/hub/api
      JUPYTERHUB_BASE_URL:            /
      JUPYTERHUB_CLIENT_ID:           jupyterhub-user-gibson
      JUPYTERHUB_HOST:
      JUPYTERHUB_OAUTH_CALLBACK_URL:  /user/gibson/oauth_callback
      JUPYTERHUB_OAUTH_SCOPES:        ["access:servers!server=gibson/", "access:servers!user=gibson"]
      JUPYTERHUB_SERVER_NAME:
      JUPYTERHUB_SERVICE_PREFIX:      /user/gibson/
      JUPYTERHUB_SERVICE_URL:         http://0.0.0.0:8888/user/gibson/
      JUPYTERHUB_USER:                gibson
      JUPYTER_IMAGE:                  taylorgibson/ncssm-datascience-notebook:fa22v1
      JUPYTER_IMAGE_SPEC:             taylorgibson/ncssm-datascience-notebook:fa22v1
      MEM_GUARANTEE:                  805306368
      MEM_LIMIT:                      1073741824
      NOTEBOOK_ARGS:                  --collaborative
    Mounts:
      /home/jovyan from volume-gibson (rw)
Conditions:
  Type              Status
  Initialized       True
  Ready             True
  ContainersReady   True
  PodScheduled      True
Volumes:
  volume-gibson:
    Type:        PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:   claim-gibson
    ReadOnly:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     hub.jupyter.org/dedicated=user:NoSchedule
                 hub.jupyter.org_dedicated=user:NoSchedule
                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:          <none>

Seems odd that we’d both be having the same issue at the same time.

My set up is running on

  • Google Cloud Platform
  • Kubernetes
  • chart=jupyterhub-1.1.3-n743.h730f014f
  • user image is based on jupyterstacks/datascience

and has been running without any major issue for several weeks smoothly. No changes made to the server or image.

So mine is bare metal, and I’m not even getting a node. After many hours of trying EVERYTHING I can think of, I can confirm it’s not the storage that’s the limitation: switching around what kind of storage eventually showed that a user requesting a server gives a PVC → PV that works ok, but the pod just still … hangs there. Never gets scheduled. This is on both my older install, and a brand new microk8s ‘follow the guide’ install from zero-to-jupyterhub.

If there was an error, or something on the scheduler, I could try to track it down, but it just sits on Pending, no errors that I can find, for 300 seconds … then gets culled. Very frustrating.

I suspect you may be running into the issue recently discovered and fixed in kubespawner here, which manifests as KubeSpawner.stop never returning. If that’s the case, restarting the Hub when you see this should get it working until you can pull a kubespawner update.

That may be true for Taylor. I don’t believe it’s true for me.

The kube-spawner is just never actually spawning user pods. And I don’t know why. The logs from the hub and the scheduler have no information aside from pending - no errors are thrown, even with debug turned on. I thought at first it was because the PVC/PV weren’t initializing correctly, but it doesn’t seem to actually impact things: whether the PVC is waiting for first consumer (e.g., using OpenEBS - Kubernetes storage simplified) or is provisioned with a PV (e.g., using nfs-provisioner), the same thing happens to the user pod … it just sits there.

Note that I followed (mostly; there’s bugs in the current docs) the microk8s guide, and it doesn’t currently work, at least on Ubuntu 20.04 or 18.04 variants. Does anyone know if that guide has been checked lately to ensure it actually gives a functional environment at the end? After running through it, the hub, load balancer, etc all seem to be working, but user logins don’t actually spawn running pods, and eventually time out. I tried a format on a machine → follow guide exactly to see for sure that it wasn’t on my weird configurations.

Update, just because I’ve been slamming my head into this for four days. I’ve now done five separate installations of z2jh, and none have worked, in reverse chronological (bottom one is the previously-working system as of last week … which I messed up doing something unrelated)

  • fresh install of Ubuntu 18.04. kubeadm, with a custom twist to have static NFS sliced storage. Set it all up, get to the end. Exact same issue.
  • fresh format, Ubuntu 18.04. microk8s, just to be sure.
  • fresh format, Ubuntu 20.04, follow the microk8s instructions in the guide. Fix some of the missing instructions, get to the end. User pods hang, wait 300 seconds, get culled. Tried three different varieties of storage options in case that was the limitation (e.g., spawner is waiting for a PV, and until it has one, it can’t proceed)
  • Ubuntu 18.04, previously had a working system with kubeadm. kubeadm reset, init, rebuild. Same hanging issue at the end.
  • previous install of Ubuntu 18.04. kubeadm. Set it all up, get to the end, exact same hanging issue mentioned above. What started me down this path.

I cannot see anything I’m missing from the usual guide … what is the pathway to debugging the spawner? The pods just sit in pending mode and it’s driving me up the wall. I cannot see why they aren’t being scheduled … what am I missing?

$ kubectl get pods

jupyter-testing 0/1 Pending 0 4s

$ kubectl describe pods jupyter-testing

Name: jupyter-testing
Namespace: default
Priority: 0
Service Account: default
Node:
Labels: app=jupyterhub
chart=jupyterhub-1.2.0
component=singleuser-server
heritage=jupyterhub
hub.jupyter.org/network-access-hub=true
hub.jupyter.org/servername=
hub.jupyter.org/username=testing
release=jhub1.5.0
Annotations: hub.jupyter.org/username: testing
Status: Pending
IP:
IPs:
Init Containers:
block-cloud-metadata:

$ kubectl logs hub-XXX-YYY

[I 2022-09-08 17:59:32.781 JupyterHub base:762] User logged in: testing
[I 2022-09-08 17:59:32.781 JupyterHub log:189] 302 POST /hub/login?next=%2Fhub%2F → /hub/ (testing@::ffff:10.244.0.1) 4.79ms
[I 2022-09-08 17:59:32.810 JupyterHub log:189] 302 GET /hub/ → /hub/spawn (testing@::ffff:10.244.0.1) 11.13ms
[D 2022-09-08 17:59:32.827 JupyterHub pages:217] Triggering spawn with default options for testing
[D 2022-09-08 17:59:32.827 JupyterHub base:880] Initiating spawn for testing
[D 2022-09-08 17:59:32.827 JupyterHub base:884] 0/64 concurrent spawns
[D 2022-09-08 17:59:32.827 JupyterHub base:889] 0 active servers
[I 2022-09-08 17:59:32.841 JupyterHub provider:574] Creating oauth client jupyterhub-user-testing
[D 2022-09-08 17:59:32.862 JupyterHub user:631] Calling Spawner.start for testing
[I 2022-09-08 17:59:32.865 JupyterHub spawner:2302] Attempting to create pod jupyter-testing, with timeout 3
[I 2022-09-08 17:59:32.867 JupyterHub log:189] 302 GET /hub/spawn → /hub/spawn-pending/testing (testing@::ffff:10.244.0.1) 41.77ms
[I 2022-09-08 17:59:32.882 JupyterHub pages:402] testing is pending spawn
[I 2022-09-08 17:59:32.883 JupyterHub log:189] 200 GET /hub/spawn-pending/testing (testing@::ffff:10.244.0.1) 2.56ms

[I 2022-09-08 18:01:17.154 JupyterHub log:189] 200 GET /hub/api/users (cull-idle@::1) 13.89ms
[W 220908 18:01:17 init:137] Not culling server testing with pending spawn

[E 2022-09-08 18:04:31.407 JupyterHub user:718] Unhandled error starting testing’s server: pod default/jupyter-testing did not start in 300 seconds!
[D 2022-09-08 18:04:31.407 JupyterHub user:819] Stopping testing
[I 2022-09-08 18:04:31.407 JupyterHub spawner:2620] Deleting pod default/jupyter-testing
[D 2022-09-08 18:04:31.421 JupyterHub user:845] Deleting oauth client jupyterhub-user-testing
[D 2022-09-08 18:04:31.438 JupyterHub user:848] Finished stopping testing
[W 2022-09-08 18:04:31.449 JupyterHub base:976] 2 consecutive spawns failed. Hub will exit if failure count reaches 5 before succeeding
[E 2022-09-08 18:04:31.449 JupyterHub gen:623] Exception in Future <Task finished name=‘Task-646’ coro=<BaseHandler.spawn_single_user..finish_user_spawn() done, defined at /usr/local/lib/python3.8/dist-packages/jupyterhub/handlers/base.py:900> exception=TimeoutError(‘pod default/jupyter-testing did not start in 300 seconds!’)> after timeout
Traceback (most recent call last):
File “/usr/local/lib/python3.8/dist-packages/tornado/gen.py”, line 618, in error_callback
future.result()
File “/usr/local/lib/python3.8/dist-packages/jupyterhub/handlers/base.py”, line 907, in finish_user_spawn
await spawn_future
File “/usr/local/lib/python3.8/dist-packages/jupyterhub/user.py”, line 736, in spawn
raise e
File “/usr/local/lib/python3.8/dist-packages/jupyterhub/user.py”, line 635, in spawn
url = await gen.with_timeout(timedelta(seconds=spawner.start_timeout), f)
File “/usr/local/lib/python3.8/dist-packages/kubespawner/spawner.py”, line 2572, in _start
await exponential_backoff(
File “/usr/local/lib/python3.8/dist-packages/jupyterhub/utils.py”, line 184, in exponential_backoff
raise TimeoutError(fail_message)
TimeoutError: pod default/jupyter-testing did not start in 300 seconds!

K3S is probably the most tested self-installed k8s distribution- it’s used in the Z2JH CI tests so we know it works. It might be worth trying it?

If you prefer to keep your existing k8s installation, can you verify that everything is working correctly before you try and use Z2JH? Kubernetes is a great abstraction layer, but if you’re maintaining your own installation you need to be sure everything is working before deploying apps which rely on those features. There are too many variations in servers, storage and networking for any guide to cover every possibility.

Thanks for the suggestion. I’ve verified each step as I’ve gone - networking (either calico or flannel); load-balancing with metal-LB; the standard kubernetes pods; and storage in 5 or 6 variations (because I thought that was where the issues were coming). And then jupyterhub would install fine on each variation, up to and including the hub pod working ‘ok’ - in the sense that the website would load, the users could login, etc. Just the user pods would stay pending forever, and I can’t figure out how I’m supposed to be debugging that when the logs of the accessible pods show nothing …

I’ll try k3s as the back-bone and report back.

I’m not sure why the suggestion of k3s was a good one - the documentation for jupyterhub on k3s is almost nonexistent. I tried it, and had to back away.

However, I wiped clean both nodes, started again with microk8s, and used their nfs-csi suggested multi-node storage option, carefully set up the storage class and defaulted it, and once I worked through the standard setup, with the NFS 4.1 server, everything works fine. So it was storage all along, as I suspected.

Specific solution: microk8s + MicroK8s - Use NFS for Persistent Volumes, steps 1-3.

If anyone sees this and can pass on how the debugging should have happened, I would very much appreciate knowing it for the future. I was able to use the NFS previously via PVC, and other applications I tested on the cluster worked fine at allocating space. Something is unique or weird about how the spawner grabs space via sc/pvc which wasn’t working until I switched to the specific nfs-csi that microk8s suggests. I’d like to know how to see the appropriate logs to have been able to tell WHY the spawners were failing, versus flailing in the dark.

Glad you worked it out! It would have taken me a long time to come up with that.

the documentation for jupyterhub on k3s is almost nonexistent. I tried it, and had to back away.

To me, bare metal kubernetes might as well not exist because it’s so inconsistently behaved, problem-prone, and hard to debug. It’s almost impossible to document for, though we could share our CI setups of k8s and k3d that we happen to use (though I definitely wouldn’t say “support”).

I’d like to know how to see the appropriate logs to have been able to tell WHY the spawners were failing, versus flailing in the dark.

My guesses:

  • kubectl get events, or
  • kubectl describe on pvs and/or pvcs since it was storage related

but that’s a stab in the dark.

It’s very unlikely a JupyterHub issue and all in Kubernetes itself, and thus only accessible via Kubernetes’ own log/events/status. It seems likely that no JupyterHub logs could have revealed any more information, and only inspecting kubernetes objects would do. If storage is preventing a pod from starting and the kubectl describe pod has no events related to that, that seems like a kubernetes problem. As far as JupyterHub is concerned, it successfully created a PVC and Pod, and then Kubernetes failed to start the pod, presumably due to some unmeetable condition or bug in the volume provider. Kubernetes should be expected to report this in the pod’s status or events, otherwise there’s not a lot we can log about it.

If there are informative events related to the PV or PVC that don’t ever get associated to the pod, we could probably try to fetch those when a launch fails.

1 Like

I’m going to write up the complete ‘here’s how I got it working on Ubuntu’ as a secondary guide for the main z2jh set of installation guides. I think it’s reproducible - bare metal → Ubuntu → microk8s → specific tweaks for single/small numbers of nodes → working installation. Might help someone!

I’ll do a PR next week with the mods. It’s half for myself, so when I have to do this again, I can remember what I did … :wink:

Thanks for the suggestions on the possible debugs. I was running describe on the PVCs, PVs, pods, etc. for days, and nothing was coming up. The kubectl logs were basically uniformative. And nothing was being dumped to the regular world journals/errors. My understanding is that most of the Jupyter errors are supposed to be stdout/stderr, so they should have been picked up by kubectl logs.

Oh well, at least it’s working, and it’s a complete bare metal solution for a small class/program/department/university. Which requires very little aside from one beefy server (although I have two). So I’ll definitely write it up - might be useful to the next me.

1 Like

@wesleyburr just wondering if you got around to doing the install writeup as we are starting along the same path soon :hugs:

I have not, I’ve been incredibly swamped for the last 4-5 months. Happy to help you out though, if you start down that path. At this point, it’ll probably be May-June before I have time to revisit this and write out my thoughts - I just got a new server delivered, so in the summer term I’m going to reprovision the whole cluster, and that’ll force me to think about this again. Feel free to send me a message, though - happy to give any guidance I can on my experiences!

I’m having this same issue deploying to a bare-metal kubernetes cluster. We are using NFS as the default storage type, and it works for other deployments (neo4j, weaviate, etc.).
The claims are created OK, but the user pod fails to start with:
Back-off restarting failed container block-cloud-metadata in pod jupyter-test_jupyterhub(08ce080b-d5a9-4182-9952-5d21aa92337c)

I’ve installed the csi driver as a different storage class using a server for NFS, but that didn’t fix the issue. I’ve not tried:
singleuser:
storage:
type: none
extraVolumes:
- name: jupyterhub-shared
persistentVolumeClaim:
claimName: jupyterhub-shared-volume
extraVolumeMounts:
- name: jupyterhub-shared
mountPath: /home/shared

That also didn’t fix the issue. I’m outta ideas. Help?
Thank you!

-Joe

I found a fix:
singleuser:
cloudMetadata:
blockWithIptables: false

1 Like