We have setup a bare-metal Kubernetes cluster, and have deployed JupyterHub on it for students here at UC Davis to use. There has been a recurring issue that is affecting our JupyterHub service. User pods sometime get stuck in the terminating, and that seems to affect other pods running on the same node as well. Our solution so far has been to use
kubectl delete pod <pod-name> --grace-period=0 --force, and to drain the affected node. We are not sure what is the root cause of this problem. Has anyone else ran into a similar problems? Hi @choldgraf, have you run into this problem with the deployments at Berkeley? Thank you in advance.
Hi @lux12337, I would guess this is a general Kubernetes issue.
Here is some input on questions I’d raise to figure out the issue.
kubectl describe pod <pod-name> and inspect status and events.
- Try to figure out if the pod’s container(s) are running still
- Try to figure out if there are issues with detaching storage for the pod, I don’t know specifically to do that but it is a hunch that it could be an issue.
- If there are a kubernetes PVC resources created for the user pod by having dynamic storage configured (default) in the zero-to-jupyterhub-k8s helm chart, I’d inspect this specific PVC and associated PV. Is the storage considered in use by a pod or not currently? Is the PVC bound to the PV? Those kinds of questions.
- What is the state of the node that the pod resides on, is the node healthy?