What happens when the Hub restarts?

On gitter, @betatim mentioned:

which made me realise i don’t actually know what happens if the hub pod restarts: does it reconstruct a view of the world from the running pods and proxy routes or does everyone get disconnected or …?

so I thought I would sketch out what’s going on when the Hub restarts, to help folks have a good understanding of what it means to restart the Hub.

tl;dr The short answer: JupyterHub is designed such that restarting the Hub should result in only a minor disruption of service. The question was in the context of Binder, which happens to be even more robust to Hub restarts. The only consequence for Binder users when the Hub restarts should be a possible increase in launch retries for launches pending while the Hub restarts.

Why would the Hub restart?

One of the reasons for making hub restarts minimally disruptive is that to update any configuration, the Hub process must be restarted. So if restarts are going to happen often, we don’t want them to be too disruptive! This also leads to the mantra of the z2jh deployment maintainer: When in doubt, delete the hub pod.

But how is that disruption minimized? Two things happen when the Hub is restarted:

  1. any state of the Hub will be reset, and
  2. the Hub is not running for a (hopefully brief) period

Hub state across restarts

Let’s start with the state. This is why JupyterHub stores its state in a database! Most of JupyterHub startup is reconstructing its state from the database and ensuring everything’s in order. This includes running servers, the proxy routing table, valid authentication cookies and tokens, etc. In an ideal case, restarting JupyterHub will result in no changes in this state at all.

note: the routing state of the proxy lives in the proxy itself. So while the Hub restarts (if the proxy is not also restarted), the routing table will remain the same as when the hub stopped, and will be reconciled with running serves (if there were any changes while the Hub was down, typically user servers stopping themselves). Conversely, since proxy state lives in the proxy, restarting the proxy will result in all connections being severed until the Hub notices and repopulates the routing table, typically on an interval of between 30 seconds and five minutes. Alternative proxy implementations such as traefik allow the proxy table to persist across proxy restarts, eliminating this disruption as well.

What about already running servers?

The more pressing issue is what happens while the Hub is (briefly!) down, during which time things that the Hub does are not going to happen until it comes back. So what is it that the Hub does?

Let’s look at a diagram of the pieces of JupyterHub:

The first thing that’s definitely not going to work is anything where browsers talk directly to the Hub. That’s any request with /hub/ in the URL, so things like:

  1. new server launches
  2. stopping servers
  3. logging in/out
  4. accessing the admin interface, etc.

Hub actions are designed to be infrequent: users should spend a relatively small fraction of the time starting servers, and most of the time using them once they are running.

Now, let’s consider those users with servers already running when the Hub goes down. When a browser makes a request to JupyterHub for a running server, that request is routed directly to the notebook server by the proxy. The Hub is not involved! If the Hub is gone, that request will begin completely uninterrupted. The Hub comes into play when the notebook server authenticates the request. It does so by initiating OAuth with the Hub. If this is a new browser, this OAuth process will fail while the Hub is down. However, completed OAuth is stored in a cookie, so if the browser already authenticated before the Hub went down, the Hub is not consulted and the request completes uninterrupted.

So that’s the main interruption:

  • requests to single-user servers that need new authentication will fail during oauth
  • requests to single-user servers from already authenticated sessions will be fine

What about Binder?

The original question was in the context of Binder, which adds a layer on top of JupyterHub. Critically, Binder users never make requests directly to the Hub, only to running servers. The only communication with the Hub comes from BinderHub itself, which makes API requests:

  1. create a user
  2. start their server
  3. wait for the server to be ready (ready means: running and with an active proxy route)

All of these requests will fail if the Hub goes down before they complete. So how does BinderHub handle this? By assuming that requests will fail sometimes, each action is retried up to a few times before giving up! If the launch process as a whole fails, the process starts from the beginning, and the user will see the message:

Launch attempt 1 failed, retrying…

Additionally, since BinderHub bypasses JupyterHub authentication, there will be no disruptions at all in connecting to running notebook servers. So in practical terms, the only consequence of the Hub restarting for Binder users should be a (hopefully small) delay in launching new servers, but no errors or need for users themselves to retry actions.

The future: high(er) availability

You may have heard about “High availability” or HA. JupyterHub is not HA because it has two single-points-of-failure (SPOF): The Hub and the Proxy. The main relevant factor here is that restarting the hub and/or proxy results in some full or partial downtime of the JupyterHub service. This problem has essentially been solved for the proxy by @GeorgianaElena. Switching to traefik allows the proxy itself to be highly available and eliminating it as a SPOF:

  1. multiple clones of the proxy can be running, so one failing does not disrupt service
  2. new instances of the proxy can be started before the previous one is retired, allowing zero-downtime upgrades
  3. proxy state is stored out-of-memory, so restarting the proxy does not lose the routing table

Giving the Hub process similar features is much more complicated, since there is a lot of in-memory state and assumptions that the Hub process “owns” the state database. Making the Hub itself able to be cloned or allowing “hot spares” is a significant undertaking, so our current target is minimizing disruption caused by Hub downtime, but eliminating it may be achievable in the future.

7 Likes

Great write up Min - thank you. We’ve been tackling HA/DR capabilities in Enterprise Gateway as well.

One of the reasons for making hub restarts minimally disruptive is that to update any configuration, the Hub process must be restarted.

You might be interested in this “dynamic configurations” PR. It was retracted, in favor of the more minimal PR (which simply exposes currently in-use config files and enables their reload), but demonstrates what an application could do to prevent restarts on configuration changes.

1 Like

This is a super write up!

To transport a discussion from gitter to here: we pondered if it was possible/not a huge amount of work for the CHP to trigger a “restore my routing table” action from the hub instead of having to wait 30s-5minutes.

There is a REST API call for that http://petstore.swagger.io/?url=https://raw.githubusercontent.com/jupyterhub/jupyterhub/master/docs/rest-api.yml#/default/post_proxy but we’d have to have a shared secret between the hub and the CHP. So it needs a bit of work to auto create it in a deployment like Zero2JupyterHub. Might be worth helping speed along the transition to using traefik instead. But maybe someone would like to spend a few cycles on this.

1 Like

That’s very cool! JupyterHub’s so heavily configurable that I’m not confident in being able to hot-load configuration without restarting the process, and as a result, many components are not designed in a way to allow stopping what they are doing to start again, instead relying on process restart. We’d need to track the effects of configuration much more rigorously (not a bad thing to do in general) in order to allow that.

I agree that new/experimental functionality like that should be facilitated by traitlets, but not merged in until it’s proven a bit more in the wild.

Yeah, I agree that adding something like this can be challenging vs. knowing there’s dynamic configuration capabilities from the start and designing with that in mind. Configuration pieces can be like herding cats, so carefully managing those pieces, knowing they can be updated, is helpful.

I’m curious by this comment since it touches on one of my frustrations:

I agree that new/experimental functionality like that should be facilitated by traitlets, but not merged in until it’s proven a bit more in the wild.

How can new functionality be “proven in the wild” prior to its merge? If there’s the ability to “opt-in” to a new feature, such that the new functionality is not exposed by default, why can’t it be merged so that the new funcitonality can be “proven in the wild”?

I mean implementing a feature in a package that wants to use it (e.g. gateway) and working out the kinks, proving that it’s valuable, then some day promoting it up to traitlets so that it can be shared by other packages that use traitlets. This has happened several times with widgets, which is the heaviest user of traitlets, where features are developed in widgets, then eventually promoted to traitlets. The same has been true of configuration patterns in notebook and jupyterhub, but they are always developed first in the packages that want to use them. Very few features of traitlets can only be implemented in the core of traitlets, they can generally be implemented downstream in subclasses, etc.

why can’t it be merged so that the new functionality can be “proven in the wild”?

It can, just traitlets is not the place. Every new feature has a maintenance cost. It doesn’t matter if it’s opt-in or not, once it’s released there are new backward-compatibility and maintenance concerns introduced for the rest of time. Introducing a feature in traitlets before folks can try it out means that people can’t use it to refine the API. And further, since it has been released, the API cannot be refined except in backward-compatible way for a long time to meet traitlets’ compatibility commitments. It makes a lot more sense for folks who want the feature to develop it in their own package as a private implementation detail that can change quickly as it’s figured out before proposing that already known-useful functionality belongs in such a fundamental package with extremely strict compatibility requirements and a long time between releases.

Thanks for the response Min and I apologize for the digression. I completely agree with API and compatibility costs & commitments. Private implementations make sense for direct relationships, but when the target change is multiple layers deep in a class hierarchy spanning projects or the functionality is systemic (like, for example, asynchronous kernel management support) the desired functionality can’t necessarily be implemented privately. Would you say that changes of this magnitude would then happen at a major release boundary (for all projects involved in the change)?

Yes, absolutely! Not everything can be done in subclasses, but I think most things can and should be if and when it is possible. I think this is one such case, but I’m not certain. With every new feature to traitlets, I start with: can this be done outside traitlets? And if not, what minimal change is missing from traitlets to enable it?

The goal for the Jupyter split is that there will never be coordinated releases, so hopefully there won’t ever need to be major releases of multiple projects at once. No release of traitlets, even a major one, should be allowed to break a supported release of another Jupyter project. The same goes for jupyter-client, etc. So we’d need to release the new feature in a non-breaking way, then adopt it as a requirement when we can/get to it downstream. This might constitute a big enough change to qualify each project as a major bump, I’m not sure.

1 Like