I’m at a loss, hoping someone can point me to a resource. Had to reinstall my JupyterHub (bare metal, followed the zero-to- guide mostly, with some custom tweaks) due to some issues outside of the Hub itself. Managed to get it up and running this morning, I thought, in the same setup as previously: NFS for storage, dynamically created; OAuth2 working correctly. The only issue remaining is that the servers just … don’t spawn. 300 seconds, nothing happens, and then it times out.
There’s no logs I can find that explain what’s happening. The hub logs are clean. The proxy logs are clean. The pods representing the user have no content. They’re never getting assigned to a node, but I can’t find any details on where/why/what is preventing the server from spawning.
kubectl describe pods jupyter-USER
gives the basics, but no indications of problems I can see. Anyone have any thoughts I could follow to try to determine wtf is happening to prevent the hub from spawning servers for the users?
Further notes: journalctl has nothing; no errors being thrown I can see. kubectl log [various pods] is vacuous.
I seem to be having the same issue today on my cluster on a new node that was started. To my knowledge, this was not an issue yesterday but only starting today. I can get a user pod to start, but the user is stuck at the “requesting server” page for 300s, and then moved on to a “Your server is stopping” page where they are stuck on forever.
I can see that the pod started on the node, and getting the status using kubectl as you did yields similar results (redacted a few items):
So mine is bare metal, and I’m not even getting a node. After many hours of trying EVERYTHING I can think of, I can confirm it’s not the storage that’s the limitation: switching around what kind of storage eventually showed that a user requesting a server gives a PVC → PV that works ok, but the pod just still … hangs there. Never gets scheduled. This is on both my older install, and a brand new microk8s ‘follow the guide’ install from zero-to-jupyterhub.
If there was an error, or something on the scheduler, I could try to track it down, but it just sits on Pending, no errors that I can find, for 300 seconds … then gets culled. Very frustrating.
I suspect you may be running into the issue recently discovered and fixed in kubespawner here, which manifests as KubeSpawner.stop never returning. If that’s the case, restarting the Hub when you see this should get it working until you can pull a kubespawner update.
That may be true for Taylor. I don’t believe it’s true for me.
The kube-spawner is just never actually spawning user pods. And I don’t know why. The logs from the hub and the scheduler have no information aside from pending - no errors are thrown, even with debug turned on. I thought at first it was because the PVC/PV weren’t initializing correctly, but it doesn’t seem to actually impact things: whether the PVC is waiting for first consumer (e.g., using OpenEBS - Kubernetes storage simplified) or is provisioned with a PV (e.g., using nfs-provisioner), the same thing happens to the user pod … it just sits there.
Note that I followed (mostly; there’s bugs in the current docs) the microk8s guide, and it doesn’t currently work, at least on Ubuntu 20.04 or 18.04 variants. Does anyone know if that guide has been checked lately to ensure it actually gives a functional environment at the end? After running through it, the hub, load balancer, etc all seem to be working, but user logins don’t actually spawn running pods, and eventually time out. I tried a format on a machine → follow guide exactly to see for sure that it wasn’t on my weird configurations.
Update, just because I’ve been slamming my head into this for four days. I’ve now done five separate installations of z2jh, and none have worked, in reverse chronological (bottom one is the previously-working system as of last week … which I messed up doing something unrelated)
fresh install of Ubuntu 18.04. kubeadm, with a custom twist to have static NFS sliced storage. Set it all up, get to the end. Exact same issue.
fresh format, Ubuntu 18.04. microk8s, just to be sure.
fresh format, Ubuntu 20.04, follow the microk8s instructions in the guide. Fix some of the missing instructions, get to the end. User pods hang, wait 300 seconds, get culled. Tried three different varieties of storage options in case that was the limitation (e.g., spawner is waiting for a PV, and until it has one, it can’t proceed)
Ubuntu 18.04, previously had a working system with kubeadm. kubeadm reset, init, rebuild. Same hanging issue at the end.
previous install of Ubuntu 18.04. kubeadm. Set it all up, get to the end, exact same hanging issue mentioned above. What started me down this path.
I cannot see anything I’m missing from the usual guide … what is the pathway to debugging the spawner? The pods just sit in pending mode and it’s driving me up the wall. I cannot see why they aren’t being scheduled … what am I missing?
K3S is probably the most tested self-installed k8s distribution- it’s used in the Z2JH CI tests so we know it works. It might be worth trying it?
If you prefer to keep your existing k8s installation, can you verify that everything is working correctly before you try and use Z2JH? Kubernetes is a great abstraction layer, but if you’re maintaining your own installation you need to be sure everything is working before deploying apps which rely on those features. There are too many variations in servers, storage and networking for any guide to cover every possibility.
Thanks for the suggestion. I’ve verified each step as I’ve gone - networking (either calico or flannel); load-balancing with metal-LB; the standard kubernetes pods; and storage in 5 or 6 variations (because I thought that was where the issues were coming). And then jupyterhub would install fine on each variation, up to and including the hub pod working ‘ok’ - in the sense that the website would load, the users could login, etc. Just the user pods would stay pending forever, and I can’t figure out how I’m supposed to be debugging that when the logs of the accessible pods show nothing …
I’m not sure why the suggestion of k3s was a good one - the documentation for jupyterhub on k3s is almost nonexistent. I tried it, and had to back away.
However, I wiped clean both nodes, started again with microk8s, and used their nfs-csi suggested multi-node storage option, carefully set up the storage class and defaulted it, and once I worked through the standard setup, with the NFS 4.1 server, everything works fine. So it was storage all along, as I suspected.
If anyone sees this and can pass on how the debugging should have happened, I would very much appreciate knowing it for the future. I was able to use the NFS previously via PVC, and other applications I tested on the cluster worked fine at allocating space. Something is unique or weird about how the spawner grabs space via sc/pvc which wasn’t working until I switched to the specific nfs-csi that microk8s suggests. I’d like to know how to see the appropriate logs to have been able to tell WHY the spawners were failing, versus flailing in the dark.
Glad you worked it out! It would have taken me a long time to come up with that.
the documentation for jupyterhub on k3s is almost nonexistent. I tried it, and had to back away.
To me, bare metal kubernetes might as well not exist because it’s so inconsistently behaved, problem-prone, and hard to debug. It’s almost impossible to document for, though we could share our CI setups of k8s and k3d that we happen to use (though I definitely wouldn’t say “support”).
I’d like to know how to see the appropriate logs to have been able to tell WHY the spawners were failing, versus flailing in the dark.
My guesses:
kubectl get events, or
kubectl describe on pvs and/or pvcs since it was storage related
but that’s a stab in the dark.
It’s very unlikely a JupyterHub issue and all in Kubernetes itself, and thus only accessible via Kubernetes’ own log/events/status. It seems likely that no JupyterHub logs could have revealed any more information, and only inspecting kubernetes objects would do. If storage is preventing a pod from starting and the kubectl describe pod has no events related to that, that seems like a kubernetes problem. As far as JupyterHub is concerned, it successfully created a PVC and Pod, and then Kubernetes failed to start the pod, presumably due to some unmeetable condition or bug in the volume provider. Kubernetes should be expected to report this in the pod’s status or events, otherwise there’s not a lot we can log about it.
If there are informative events related to the PV or PVC that don’t ever get associated to the pod, we could probably try to fetch those when a launch fails.
I’m going to write up the complete ‘here’s how I got it working on Ubuntu’ as a secondary guide for the main z2jh set of installation guides. I think it’s reproducible - bare metal → Ubuntu → microk8s → specific tweaks for single/small numbers of nodes → working installation. Might help someone!
I’ll do a PR next week with the mods. It’s half for myself, so when I have to do this again, I can remember what I did …
Thanks for the suggestions on the possible debugs. I was running describe on the PVCs, PVs, pods, etc. for days, and nothing was coming up. The kubectl logs were basically uniformative. And nothing was being dumped to the regular world journals/errors. My understanding is that most of the Jupyter errors are supposed to be stdout/stderr, so they should have been picked up by kubectl logs.
Oh well, at least it’s working, and it’s a complete bare metal solution for a small class/program/department/university. Which requires very little aside from one beefy server (although I have two). So I’ll definitely write it up - might be useful to the next me.
I have not, I’ve been incredibly swamped for the last 4-5 months. Happy to help you out though, if you start down that path. At this point, it’ll probably be May-June before I have time to revisit this and write out my thoughts - I just got a new server delivered, so in the summer term I’m going to reprovision the whole cluster, and that’ll force me to think about this again. Feel free to send me a message, though - happy to give any guidance I can on my experiences!
I’m having this same issue deploying to a bare-metal kubernetes cluster. We are using NFS as the default storage type, and it works for other deployments (neo4j, weaviate, etc.).
The claims are created OK, but the user pod fails to start with:
Back-off restarting failed container block-cloud-metadata in pod jupyter-test_jupyterhub(08ce080b-d5a9-4182-9952-5d21aa92337c)
I’ve installed the csi driver as a different storage class using a server for NFS, but that didn’t fix the issue. I’ve not tried:
singleuser:
storage:
type: none
extraVolumes:
- name: jupyterhub-shared
persistentVolumeClaim:
claimName: jupyterhub-shared-volume
extraVolumeMounts:
- name: jupyterhub-shared
mountPath: /home/shared
That also didn’t fix the issue. I’m outta ideas. Help?
Thank you!