One day I hope to write up a doc about this, specifically for using zero-to-jupyterhub-k8s, but until then there are some recent(ish) related threads that might help you get started [1][2][3][4][5].
Since the hub does not (natively) support HA [6] you can’t run multiple replicas of it and scale horizontally that way. And since it’s a single python process you get 1 CPU for the hub (and KubeSpawner in the same process) so keep an eye on CPU usage. To keep CPU usage down and API response rates low you will likely need to tune various config options related to reporting notebook activity so your thousands of users and notebook pods aren’t storming the hub API with activity updates and DB writes which will consume CPU and starve the hub.
You will also want to keep an eye on the cull-idle script if you have thousands of users on a single hub. In our case we changed the concurrency on that to 1 to reduce its load on the API, we set the timeout to 5 days and run it every hour, though the notebooks will cull themselves (and delete the pod) after an hour of inactivity. We set the cull-idle lower because we also have that configured to cull users. Needing to do a GET /users request with thousands of users can take awhile currently because of a lack of paging and server side filtering in the hub DB [7].
There are a couple of specific things I can point out here if you’re using zero-to-jupyterhub-k8s:
The hub API will return 429 responses with a retry-after header if you’ve hit the concurrentSpawnLimit. We see that happening at the start of a large user event so just make sure client side tooling can handle that 429 response and retry appropriately.
If you hit the consecutiveFailureLimit the hub will crash. Kubernetes should restart the hub pod but it does mean a restart of the hub and depending on how many users you have in the database and how your cull-idle service is setup, which runs on hub restart, the hub restart could take longer than you want. In our experience, as long as we have notebook images pre-pulled on the user nodes and have enough idle placeholders pre-created for a large user event, we don’t suffer from the consecutive failure limit issue. See [1] for more details.
@mriedem, thank you so much for the incredibly in-depth response. I support your decision to create consolidated documentation on this topic and am happy to help in any way that I can.
Would you feel comfortable providing an approximate upper limit (or range) of concurrent pods before performance degrades?
The upper limit depends on a few things that Matt mentioned. Activity tracking and the Kubespawner were the biggest performances issues we’ve seen so far. Increasing the hub_activity_interval and activity_resolution helped. We also saw a good improvement from changing last_activity_interval [1]. If you’re using zero-to-jupyterhub it sets that value to 1 minute which is way too frequent. It had a noticeable effect on performance until we changed it back to the default of 5 minutes.
We also saw great improvements by making some changes to the kubespawner that are detailed here [2].
All of that is a long way of saying I don’t know exactly what the upper limit is. With the stock kubespawner we saw performance problems at ~1000 pods. With those issues fixed we’ve scaled up to 3000 pods without any issue. Likely we could go higher with more Kubernetes nodes. Steady state performance seems to be dominated by the various activity interval settings. The less often you update that information the more concurrent pods you can support.
@rmoe 3000 pods without issue is music to my ears. That will cover us for a long time, more than enough to implement the two-hub + router solution mentioned in one of the GH issues.
I can’t help but wonder if there’s an opportunity to follow the paradigm that Dask Gateway adopted and provide support for two implementations: one which interfaces with a standalone database for backends that require it, and one where it interfaces with a backend-native database. Instead of managing state on its own, Dask Gateway extends the Kubernetes API with a CRD and relies on etcd for state persistence.
I’m very sorry to barge in on this thread - but I’m working through your excellent suggestions, and now I am trying to work out what the canonical way is, to upgrade ‘kubespawner’ to the development version, with your fix at https://github.com/jupyterhub/kubespawner/issues/423. Do I have to make my own helm chart for that? And my own helm chart repository?
We were encountering many many requests at 1s+ latencies! This basically made the hub unavailable - it was dropping requests on the floor, so many requests didn’t even make to it.
UC Berkeley’s infra is now stable thanks to @rmoe’s work. THANK YOU