Core component resilience/reliability

I’m setting up JupyterHub on K8s in a large-scale enterprise environment and anticipate thousands of concurrent users.

Is there any documentation on improving the resilience/reliability of the core components (proxy, hub, and spawner)?

I’d like to make sure that, at a bare minimum, I have at least one backup pod for core component.

I’m also curious about what the common failure points are when traffic exceeds a certain threshold.

Thank you!

One day I hope to write up a doc about this, specifically for using zero-to-jupyterhub-k8s, but until then there are some recent(ish) related threads that might help you get started [1][2][3][4][5].

Since the hub does not (natively) support HA [6] you can’t run multiple replicas of it and scale horizontally that way. And since it’s a single python process you get 1 CPU for the hub (and KubeSpawner in the same process) so keep an eye on CPU usage. To keep CPU usage down and API response rates low you will likely need to tune various config options related to reporting notebook activity so your thousands of users and notebook pods aren’t storming the hub API with activity updates and DB writes which will consume CPU and starve the hub.

You will also want to keep an eye on the cull-idle script if you have thousands of users on a single hub. In our case we changed the concurrency on that to 1 to reduce its load on the API, we set the timeout to 5 days and run it every hour, though the notebooks will cull themselves (and delete the pod) after an hour of inactivity. We set the cull-idle lower because we also have that configured to cull users. Needing to do a GET /users request with thousands of users can take awhile currently because of a lack of paging and server side filtering in the hub DB [7].

[1] Identifying JupyterHub api performance bottleneck
[2] Scheduler "insufficient memory.; waiting" errors - any suggestions?
[3] Minimum specs for JupyterHub infrastructure VMs?
[4] Background for JupyterHub / Kubernetes cost calculations?
[5] Confusion of the db instance
[6] https://github.com/jupyterhub/jupyterhub/issues/1932
[7] https://github.com/jupyterhub/jupyterhub/issues/2954

1 Like

There are a couple of specific things I can point out here if you’re using zero-to-jupyterhub-k8s:

  1. The hub API will return 429 responses with a retry-after header if you’ve hit the concurrentSpawnLimit. We see that happening at the start of a large user event so just make sure client side tooling can handle that 429 response and retry appropriately.
  2. If you hit the consecutiveFailureLimit the hub will crash. Kubernetes should restart the hub pod but it does mean a restart of the hub and depending on how many users you have in the database and how your cull-idle service is setup, which runs on hub restart, the hub restart could take longer than you want. In our experience, as long as we have notebook images pre-pulled on the user nodes and have enough idle placeholders pre-created for a large user event, we don’t suffer from the consecutive failure limit issue. See [1] for more details.

[1] Optimizations — Zero to JupyterHub with Kubernetes documentation

@mriedem, thank you so much for the incredibly in-depth response. I support your decision to create consolidated documentation on this topic and am happy to help in any way that I can.

Would you feel comfortable providing an approximate upper limit (or range) of concurrent pods before performance degrades?

The upper limit depends on a few things that Matt mentioned. Activity tracking and the Kubespawner were the biggest performances issues we’ve seen so far. Increasing the hub_activity_interval and activity_resolution helped. We also saw a good improvement from changing last_activity_interval [1]. If you’re using zero-to-jupyterhub it sets that value to 1 minute which is way too frequent. It had a noticeable effect on performance until we changed it back to the default of 5 minutes.

We also saw great improvements by making some changes to the kubespawner that are detailed here [2].

All of that is a long way of saying I don’t know exactly what the upper limit is. With the stock kubespawner we saw performance problems at ~1000 pods. With those issues fixed we’ve scaled up to 3000 pods without any issue. Likely we could go higher with more Kubernetes nodes. Steady state performance seems to be dominated by the various activity interval settings. The less often you update that information the more concurrent pods you can support.

1 Like

Posting the rest of my links as it wouldn’t let me add them all to the previous comment.

[1] https://jupyterhub.readthedocs.io/en/stable/api/app.html#jupyterhub.app.JupyterHub.last_activity_interval
[2] https://github.com/jupyterhub/kubespawner/issues/423

1 Like

@rmoe 3000 pods without issue is music to my ears. That will cover us for a long time, more than enough to implement the two-hub + router solution mentioned in one of the GH issues.

I can’t help but wonder if there’s an opportunity to follow the paradigm that Dask Gateway adopted and provide support for two implementations: one which interfaces with a standalone database for backends that require it, and one where it interfaces with a backend-native database. Instead of managing state on its own, Dask Gateway extends the Kubernetes API with a CRD and relies on etcd for state persistence.

This encapsulates much of those changes.

Obviously gargantuan in scope, but figured it was worth mentioning.

Thanks a ton for the follow up. I’m really looking forward to contributing to the upstream in the not-so-distant future.

1 Like

I’m very sorry to barge in on this thread - but I’m working through your excellent suggestions, and now I am trying to work out what the canonical way is, to upgrade ‘kubespawner’ to the development version, with your fix at https://github.com/jupyterhub/kubespawner/issues/423. Do I have to make my own helm chart for that? And my own helm chart repository?

Sorry to punish you for all your helpful advice.

You can use the latest dev version of zero-to-jupyterhub (from https://jupyterhub.github.io/helm-chart/#development-releases-jupyterhub). It includes kubespawner 0.13, which has the PR mentioned here.

@rmoe’s PR is beautiful, look at the CPU usage reduction here:

The change was deployed on 09/05 sometime, and you can see the big difference it makes.

More importantly, you can see the change in response latencies.

We were encountering many many requests at 1s+ latencies! This basically made the hub unavailable - it was dropping requests on the floor, so many requests didn’t even make to it.

UC Berkeley’s infra is now stable thanks to @rmoe’s work. THANK YOU

3 Likes

Thanks - that’s very helpful.

Just because I explored more, I could have found the kubespawner version in the latest Helm chart, by searching in https://github.com/jupyterhub/zero-to-jupyterhub-k8s/blob/cd1eff7a5f093453de4d5868d1c4148044c0db23/images/hub/requirements.txt

1 Like

Wow thank you for sharing this @yuvipanda and thank you @rmoe for your work! :heart: :tada:!