On gitter, @betatim mentioned:
which made me realise i don’t actually know what happens if the hub pod restarts: does it reconstruct a view of the world from the running pods and proxy routes or does everyone get disconnected or …?
so I thought I would sketch out what’s going on when the Hub restarts, to help folks have a good understanding of what it means to restart the Hub.
tl;dr The short answer: JupyterHub is designed such that restarting the Hub should result in only a minor disruption of service. The question was in the context of Binder, which happens to be even more robust to Hub restarts. The only consequence for Binder users when the Hub restarts should be a possible increase in launch retries for launches pending while the Hub restarts.
Why would the Hub restart?
One of the reasons for making hub restarts minimally disruptive is that to update any configuration, the Hub process must be restarted. So if restarts are going to happen often, we don’t want them to be too disruptive! This also leads to the mantra of the z2jh deployment maintainer: When in doubt, delete the hub pod.
But how is that disruption minimized? Two things happen when the Hub is restarted:
- any state of the Hub will be reset, and
- the Hub is not running for a (hopefully brief) period
Hub state across restarts
Let’s start with the state. This is why JupyterHub stores its state in a database! Most of JupyterHub startup is reconstructing its state from the database and ensuring everything’s in order. This includes running servers, the proxy routing table, valid authentication cookies and tokens, etc. In an ideal case, restarting JupyterHub will result in no changes in this state at all.
note: the routing state of the proxy lives in the proxy itself. So while the Hub restarts (if the proxy is not also restarted), the routing table will remain the same as when the hub stopped, and will be reconciled with running serves (if there were any changes while the Hub was down, typically user servers stopping themselves). Conversely, since proxy state lives in the proxy, restarting the proxy will result in all connections being severed until the Hub notices and repopulates the routing table, typically on an interval of between 30 seconds and five minutes. Alternative proxy implementations such as traefik allow the proxy table to persist across proxy restarts, eliminating this disruption as well.
What about already running servers?
The more pressing issue is what happens while the Hub is (briefly!) down, during which time things that the Hub does are not going to happen until it comes back. So what is it that the Hub does?
Let’s look at a diagram of the pieces of JupyterHub:
The first thing that’s definitely not going to work is anything where browsers talk directly to the Hub. That’s any request with
/hub/ in the URL, so things like:
- new server launches
- stopping servers
- logging in/out
- accessing the admin interface, etc.
Hub actions are designed to be infrequent: users should spend a relatively small fraction of the time starting servers, and most of the time using them once they are running.
Now, let’s consider those users with servers already running when the Hub goes down. When a browser makes a request to JupyterHub for a running server, that request is routed directly to the notebook server by the proxy. The Hub is not involved! If the Hub is gone, that request will begin completely uninterrupted. The Hub comes into play when the notebook server authenticates the request. It does so by initiating OAuth with the Hub. If this is a new browser, this OAuth process will fail while the Hub is down. However, completed OAuth is stored in a cookie, so if the browser already authenticated before the Hub went down, the Hub is not consulted and the request completes uninterrupted.
So that’s the main interruption:
- requests to single-user servers that need new authentication will fail during oauth
- requests to single-user servers from already authenticated sessions will be fine
What about Binder?
The original question was in the context of Binder, which adds a layer on top of JupyterHub. Critically, Binder users never make requests directly to the Hub, only to running servers. The only communication with the Hub comes from BinderHub itself, which makes API requests:
- create a user
- start their server
- wait for the server to be ready (ready means: running and with an active proxy route)
All of these requests will fail if the Hub goes down before they complete. So how does BinderHub handle this? By assuming that requests will fail sometimes, each action is retried up to a few times before giving up! If the launch process as a whole fails, the process starts from the beginning, and the user will see the message:
Launch attempt 1 failed, retrying…
Additionally, since BinderHub bypasses JupyterHub authentication, there will be no disruptions at all in connecting to running notebook servers. So in practical terms, the only consequence of the Hub restarting for Binder users should be a (hopefully small) delay in launching new servers, but no errors or need for users themselves to retry actions.
The future: high(er) availability
You may have heard about “High availability” or HA. JupyterHub is not HA because it has two single-points-of-failure (SPOF): The Hub and the Proxy. The main relevant factor here is that restarting the hub and/or proxy results in some full or partial downtime of the JupyterHub service. This problem has essentially been solved for the proxy by @GeorgianaElena. Switching to traefik allows the proxy itself to be highly available and eliminating it as a SPOF:
- multiple clones of the proxy can be running, so one failing does not disrupt service
- new instances of the proxy can be started before the previous one is retired, allowing zero-downtime upgrades
- proxy state is stored out-of-memory, so restarting the proxy does not lose the routing table
Giving the Hub process similar features is much more complicated, since there is a lot of in-memory state and assumptions that the Hub process “owns” the state database. Making the Hub itself able to be cloned or allowing “hot spares” is a significant undertaking, so our current target is minimizing disruption caused by Hub downtime, but eliminating it may be achievable in the future.