When debugging reports of users not being able to start their servers, I discovered the following interesting snippet in the logs:
[I 2018-11-07 03:12:37.822 JupyterHub proxy:242] Adding user user1 to proxy /user/user1/ => http://hub-7d7bc8659b-wbswv:35349
[I 2018-11-07 03:12:37.822 JupyterHub proxy:242] Adding user user2 to proxy /user/user2/ => http://hub-7d7bc8659b-wbswv:49653
[I 2018-11-07 03:12:37.823 JupyterHub proxy:242] Adding user user3 to proxy /user/user3/ => http://hub-7d7bc8659b-wbswv:40787
[I 2018-11-07 03:12:37.824 JupyterHub proxy:242] Adding user user4 to proxy /user/user4/ => http://hub-7d7bc8659b-wbswv:48817
[I 2018-11-07 03:12:37.824 JupyterHub proxy:242] Adding user user5 to proxy /user/user5/ => http://hub-7d7bc8659b-wbswv:35191
[I 2018-11-07 03:12:37.825 JupyterHub proxy:242] Adding user user6 to proxy /user/user6/ => http://hub-7d7bc8659b-wbswv:32827
[I 2018-11-07 03:12:37.826 JupyterHub proxy:242] Adding user user7 to proxy /user/user7/ => http://hub-7d7bc8659b-wbswv:34835
[I 2018-11-07 03:12:37.826 JupyterHub proxy:242] Adding user user8 to proxy /user/user8/ => http://hub-7d7bc8659b-wbswv:58385
[I 2018-11-07 03:12:37.827 JupyterHub proxy:242] Adding user user9 to proxy /user/user9/ => http://hub-7d7bc8659b-wbswv:33395
[I 2018-11-07 03:12:37.827 JupyterHub proxy:242] Adding user user10 to proxy /user/user10/ => http://hub-7d7bc8659b-wbswv:41343
[I 2018-11-07 03:12:37.828 JupyterHub proxy:242] Adding user user11 to proxy /user/user11/ => http://hub-7d7bc8659b-wbswv:33805
[I 2018-11-07 03:12:37.828 JupyterHub proxy:242] Adding user user12 to proxy /user/user12/ => http://hub-7d7bc8659b-wbswv:37561
The hub is asking the proxy to route these users to itself on random ports, while it should be asking it to route users to a pod’s IP at a static port. Here is a log line that is correct:
[I 2018-11-07 17:46:45.784 JupyterHub proxy:242] Adding user user13 to proxy /user/user13/ => http://10.240.1.9:8888
With the wrong routes, these users are never really able to access their servers, until explicitly started and stopped.
All of the users with servers with wrong routes also had an entry like:
[W 2018-11-07 03:12:37.820 JupyterHub proxy:324] Adding missing route for /user/<username>/ (<jupyterhub.objects.Server object at 0x7f639b29ac18>)
Although there were also users who were missing routes but got added a correct route.
I believe these are Server objects where the ip is not set (default ip unset -> 0.0.0.0 -> connect ip is hostname).
So what we are seeing is a collection of users being added to the proxy, likely when they shouldn’t be.
There are two probable causes:
the servers are running, but failed to set their ip for some reason
the servers aren’t running, but a Server object exists in the database that shouldn’t.
Since it was determined on Gitter that these messages are occurring during jupyterhub startup, my hunch is that we have inconsistent state resulting from jupyterhub shutting down while several user servers have outstanding spawn requests. Some of these may have completed, others may not.
Restoring the list of running servers at startup happens here:
if spawner.server:
try:
status = await spawner.poll()
except Exception:
self.log.exception("Failed to poll spawner for %s, assuming the spawner is not running.",
spawner._log_name)
status = -1
The first check is for if spawner.server is defined. This occurs immediately, prior to calling Spawner.start, so even if Spawner.start never started, this will be True. The second check is to recover from that - check if the Spawner itself thinks the server is running. Hopefully, this will return False if the server isn’t running, otherwise we are out of luck.
Here’s where I think we have the problem:
spawn is requested (Server object instantiated with uninitialized ip, port; pod created)
hub shuts down before spawn gets a change to return the url where it’s running
hub restarts: spawner is running now, server exists but is still uninitialized
So what we have is:
the Hub knows the server is running
the Spawner knows the server is running
the Hub doesn’t know where the server is!
Some Spawners (kubespawner) have the ability to retrieve the url from the object. Adding the ability to refresh the URL from the spawner at this stage should solve the problem (if this is indeed what’s happening). Other Spawners (Local) don’t have this information, and rely on the jupyterhub database as the url storage. So if we in general have a check after the existing two that does:
give the spawner a chance to refresh the url (no-op by default for spawners that can’t), and
add an is_up server check before assuming the spawner is running, shutting it down if not.
This should allow spawners like kubespawner to resolve the correct url and recover, and all spawners to kill server instances that may have only partially started during hub restart.
I am seeing this again reproducibly on every restart, and think Azure’s longer startup times might be a factor. Maybe we can repro this if simulate slow starts and restart in the middle of starts.
A follow-up data point that would be useful: when this happens, are the pods with bad URLs running or not? If they are not, then Spawner.poll() should have caught this. If they are, then I think the added checks above should be all we need.
We also root-caused this whole issue to the hub running extremely slowly because the sqlite db was accidentally on SMB So this only manifests in cases where the db is extremely slow.