This is a z2jh 0.9.0 deployment on Google Cloud.
One user continually gets a dialog saying Server Not Running
Your server at /user/xxxxxxxxxxxxxx/ is not running. Would you like to restart it? Restart/Dismiss
In the hub pod container logs are a bunch of entries like this: [W 2020-05-27 13:26:49.448 JupyterHub proxy:355] Updating route for /user/xxxxxxxxxxx/ (http://10.4.14.44:8888 → Server(url=http://10.4.14.43:8888/user/xxxxxxxxxxxxxx/, bind_url=http://10.4.14.43:8888/user/xxxxxxxxxxxx/))
Followed by: [I 2020-05-27 13:26:49.449 JupyterHub proxy:262] Adding user xxxxxxxxxxxx to proxy /user/xxxxxxxxxxxxxx/ => http://10.4.14.43:8888
But the user’s pod is on 10.4.14.44, not 10.4.14.43:
kubectl get pod -o wide --namespace jhub
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
...
jupyter-xxxxxxxxxxxxxxxx 1/1 Running 0 8h 10.4.14.44 gke-jhub-big-user-pool-xxxxxxxxxx <none> <none>
…
Indeed, there are previous messages in the hub logs showing the route updated to 10.4.14.44 and the user added at that correct IP address, but they are followed a second later by messages of the form above, pointing to the wrong IP address.
Any ideas on how to fix this? Grateful for any suggestions.
kubectl get pod -o json | jq -r '.items[] | "\(.status.podIP): \(.metadata.name)"' | sort
to show all pods by their ip. Is it possible that a renegade pod is left still running that shouldn’t be? Does restarting the Hub pod get this back into the right state?
I’m not sure how this could occur, but maybe pods can change their ip over time? If that’s the case, KubeSpawner.poll should check status.pod_ip in poll and update it when a change is noticed. Restarting the Hub should ultimately do the same thing for all pods.
Turns out this was resolved, though I’m afraid I’m not entirely sure how. But for reference, here was the situation:
Doing a helm upgrade didn’t fix it.
Manually killing the user pod didn’t fix it (just came back at same IP address, and proxy was still trying to find it at the wrong one).
Manually killing the hub deployment and then doing a helm upgrade to get it all back appears to have been what worked.
All has been stable for a few days now.
For additional context, this was a setup where the version of jhub in the singleuser image and the version on the hub had gotten out of sync. This was causing messages about redirect loops. I moved to 0.9.0 in order to sync those versions up. I thought the proxy-to-wrong-IP address would be solved by that upgrade, but it wasn’t.
I am a bit flummoxed about how this eventually fixed itself.
Glad it worked out, but feel free to post back debug info if you see it again. Some more detail about the relevant behavior of kubespawner and jupyterhub that might help iron out which assumptions are wrong:
kubespawner returns pod.status.podIp after start completes, which is how the hub connects to the pod
this value is never updated or validated again so if a pod’s ip somehow changes, JupyterHub will not notice (fixed in this PR)
jupyterhub validates that something is running at the given IP only twice:
after start of the pod
at hub startup
Notes about this check
it only verifies that something is there, so if the ip is wrong and something else is using that ip:port, it will think everything is okay
if the ip is wrong and nothing else is running an httpserver on that ip:port, the hub will notice on restart and force the pod to shutdown. This is why restarting the hub is often a way to get everything back to a consistent state.