We’re currently using z2jh 0.9.0-n212.hdd4fd8bd with kubespawner 0.13.0 and jupyterhub at git hash 8a3790b01ff944c453ffcc0486149e2a58ffabea (we’re in the process of upgrading to z2jh 0.10.6, kubespawner 0.15.0 and jupyterhub 1.2.2). We had a user report that they couldn’t start their notebook server and I noticed it was because the hub DB thought the server was pending a stop:
Mar 22 09:13:57 hub-7b66dc7c9c-n8vn2 hub WARNING WARNING 2021-03-22T14:13:57.741Z [JupyterHub web:1786] 400 POST /hub/api/users/603ba6b21b93e5086eec8eb0/server (10.241.6.8): 603ba6b21b93e5086eec8eb0 is pending stop
However, the pod was already gone so the DB was out of sync. I tried deleting the server using the REST API and got a 202 but the server wasn’t deleted:
Mar 22 10:28:23 hub-7b66dc7c9c-n8vn2 hub INFO INFO 2021-03-22T15:28:23.299Z [JupyterHub log:178] 202 DELETE /hub/api/users/603ba6b21b93e5086eec8eb0/server (5e1895ecbbc00e0011fbba1d@10.187.252.153) 40.59ms
Ultimately we had to restart the hub to flush the stuck pod, this was in the logs:
Mar 22 10:34:20 hub-79c4dc8d66-lsts2 hub WARNING WARNING 2021-03-22T15:34:20.590Z [JupyterHub app:2042] 603ba6b21b93e5086eec8eb0 appears to have stopped while the Hub was down
While looking at the delete server API code I noticed the remove flag which was added for removing named servers.
In this situation I’m not even sure if it would help, it might actually be a 500 error if spawner._stop_future isn’t set. I was essentially hoping for some kind of force delete since I know the pod is gone but for whatever reason the hub and kubespawner aren’t getting that synchronized in some periodic task - only on hub restart was that fixed. Maybe whatever our issue was is fixed in newer kubespawner/jupyterhub, but figured I’d ask about documenting that remove option in both APIs.
I don’t think removing the default server makes a lot of sense at this point. Note that the remove option is just removing the Spawner record from the database. It does not take additional actions to cleanup resources, there is no hook (yet) for that. The main effect it has is for editing the named server list on the Hub landing page, where the default server gets special treatment and never doesn’t exist, regardless of whether it’s in the db or not.
This could change following hooks like Spawner.delete_forever, which will allow for deletion of persistent resources on user deletion, or (to be implemented) named server deletion.
I was essentially hoping for some kind of force delete
This we don’t have yet.
For this particular issue, I think it’s a bug in jupyterhub and/or KubeSpawner that the pending stop was allowed to get stuck.
The most likely scenarios, I think:
kubespawner actually finished stopping and jupyterhub failed to clear the pending state (we used to have many of these issues, but they have become rarer over time)
kubespawner got stuck attempting to delete forever, not noticing that delete finished, and JupyterHub doesn’t handle the inconsistent, undefined state of stop never returning.
We can easily enough clear the pending state in the event of stop never returning by adding a timeout here, but then what state are we in? Is stop still trying to clean things up? That’s why there isn’t a timeout there now, because it’s unclear what unsafe thing jupyterhub should do (since every option is unsafe) in that situation. We can define a stop timeout and follow-up actions, but any time that timeout is hit we are in a pretty dangerous situation regarding pending transactions and unclean state. We could potentially even define a post_failed_stop() cleanup method on Spawners to call in this case, but ultimately I thnk that logic really belongs in try/except blocks in the Spawner.stop implementation itself.
Restarting the Hub is the way to clear pending transactions and re-establish database consistency, so you did the right thing to recover from this state, lacking a more forceful stop option. With only one resource (the pod), it’s not too hard to recover from, but if there were other resources to cleanup, it’s entirely possible they will get orphaned and need manual cleanup afterward.
Yeah I think we’re going to add a script as a hub-managed service which will serve as our livenessProbe. It will hit the GET /health endpoint like z2jh is doing today and then also check for pending servers where the last_activity is obviously old/wrong. For example, on the problem server we had earlier in the week this was part of the user record:
The last_activity on the user was current (as of March 22) but the last_activity on the server was obviously stale. Given that, we can fail the liveness probe and automatically restart the hub rather than wait for a user to notify us of this issue.