How to cleanup orphaned user pods after bug in z2jh 3.0 and KubeSpawner 6.0

A bug in the z2jh 3.0 (The JupyterHub helm chart) using KubeSpawner 6.0 has resulted in running users being disrupted whenever the hub pod starts up, for example when upgrading to z2jh 3.0 or re-configuring the chart causing the pod to restart.

Deployment impacted

This bug was introduced in the 3.0.0 release, and is patched in the 3.1.0 release. The bug was also part of the 3.0.0-alpha.1 pre-release, and development version releases and later.


Users’ perspective

This kind of disruption as seen by a JupyterHub user is to get redirected from /user/<username> where they typically work back to /hub where they depending on the JupyterHub’s configuration would be prompted to start a server again or end up automatically start a server.

Admins’ perspective

Whenever the hub pod restarts with a bug affected version, you may see that the /hub/admin panel reports user servers as stopped even though you see running user server pods in Kubernetes.

When user servers are inactive, they are typically automatically stopped by jupyterhub-idle-culler that is enabled by default in z2jh (cull.enabled), but this won’t work in this case as JupyterHub considers the servers stopped already.

Cleanup orphaned user server pods

If you have been using z2jh 3.0.0-alpha.1 to 3.0.3, you should check for orphaned user server pods that JupyterHub doesn’t consider running.

Using @minrk’s Python script

@minrk has written a Python script found in this gist to cleanup user servers. It can be be run by a user with administrative access to Kubernetes and JupyterHub itself.

From a computer with Python and kubectl with to access the Kubernetes cluster with a JupyterHub installed, do the following:

  1. Download Min’s script from · GitHub
  2. Visit and request a token with short lived access duration
  3. Set the environment variable JUPYTERHUB_API_TOKEN to the token from the previous step. Note that requested permissions needed is to read information about all users, which admin has but not non-admin users.
  4. Configure and verify access to the Kubernetes cluster where the JupyterHub is running
  5. Run the script and pass it the url to your hub and the k8s namespace via the --namespace flag

Practically on a mac or linux computer, this can look like this:

# 1. Download script
# 2. Request an API token from /hub/token
# 3. Set environment variable for use by script
export JUPYTERHUB_API_TOKEN=1234567890abcdef1234567890abcdef
# 4. Verify you can work against the k8s cluster and it seems to be the right namespace
kubectl get all --namespace <namespace>
# 5. Run the script
python --namespace <namespace>

The script should now have printed information and a kubectl delete pod command you can run listing all orphaned pods. Copy it, add --namespace <namespace> to it, and then run it to delete all the orphaned pods detected by the script.

Using a helm config

I’ve adjusted Min’s script to run inside the hub pod by using a JupyterHub chart config file, and to not ask to delete the detected orphaned servers on startup. This can be useful if you manage several JupyterHub’s with shared configuration files for example.

# 1. Download JupyterHub chart config addition

# 2. Verify that the chart config file is nested correctly, its made to work
#    assuming the jupyterhub chart isn't a chart dependency. If you have a
#    helm chart that in turn depends on the jupyterhub chart, you would need
#    to nest the configuration for example.

# 3. Perform an chart upgrade referencing the chart config addition
helm upgrade <...> --values cleanup-service.values.yaml

# 4. Get the hub pod's logs
kubectl logs deploy/hub

# 5. Look for log lines like these
# INFO:/tmp/ 1 active user servers according to JupyterHub
# INFO:/tmp/ 1 active user server pods according to Kubernetes
# INFO:/tmp/ user server pods are orphaned
# INFO:/tmp/ of orphaned pods complete.

# 6. Perform a chart upgrade without the cleanup service
helm upgrade <...>

This was a pretty big head scratcher for me when I saw that user API payloads were declaring that some servers were stopped/not ready even though the pods were clearly still running.

I ran into this issue when upgrading a reconfigured helm chat, specifically the user placeholders configurations. My biggest concern was that the culler wouldn’t even register that these servers/pods were still running indefinitely.

Then I saw the changelog notes regarding this issues for 3.1.0 right before I started really investigating.

Great triage and solution, super clear. Thanks for the help

1 Like