I’ve been working on a script that does just this against the hub in our testing environment. If I can get the script scrubbed of internal details I could post it on GitHub.
I’m mostly interested in pushing the limits of how many users/pods we can have running at a time before the hub crashes. So for my scale testing I’m leveraging the profile_list
feature in KubeSpawner and z2jh [1] with a micro
profile [2] for tiny pods. What I found out after doing this is for our case the size of the pods don’t matter as much because there is a hard pod limit of 110 on the user nodes for the K8S service in our (IBM) cloud (I thought I could pack ~500 micro pods on a 32GB RAM node but nope!). So I have to scale up user-placeholder
replicas before running the load test so that we have enough user worker nodes ready to go otherwise the hub starts tipping over on the consecutive_failure_limit.
We’re using a postgresql 12 database and the default configurable-hub-proxy setup from z2jh.
One thing I’ve noted elsewhere is we needed to turn down c.JupyterHub.activity_resolution to keep CPU usage down on the hub once we get several hundred notebook pods.
Today I got up to 1700 active pods with only 2 failed server starts (hit the 5 minute start_timeout from z2jh). That’s not our limit yet, I just have run out of time to keep running this thing today.
Some other notes on my testing:
- Set
c.JupyterHub.init_spawners_timeout = 1
so we’re not waiting for that on hub restarts.
- Set
c.KubeSpawner.k8s_api_threadpool_workers = c.JupyterHub.concurrent_spawn_limit
though this doesn’t seem to make much difference [3].
- I’m setting
c.NotebookApp.shutdown_no_activity_timeout
and c.MappingKernelManager.cull_idle_timeout
to 8 hours from our normal 1 hour timeout just so that notebook pods aren’t killing themselves while I’m scaling up in bursts because I want to see the hub response rate and CPU resource usage in steady state.
- We’ve overridden the cull.concurrency value from 10 to 1 just so we’re not hammering the hub API with large
GET /users
requests when we have hundreds of users [4].
Something I need to do next is run py-spy while the hub is in steady state with a large number of pods running (1K+) to see where time is being spent and look for optimizations.
The script has to handle retrying on 429 responses because I hit concurrent_spawn_limit frequently during scale up but the script is python and with the requests/urllib3 libraries that’s easy to configure.
I was hoping to do a more thoughtful write up on this at some point with findings and results but I’m still in the progress of doing it, though I couldn’t resist replying to your question given the overlap with what I’m currently doing. Again, if I can get this script scrubbed a bit I’m happy to share it on GitHub but it might be a bit.
[1] https://zero-to-jupyterhub.readthedocs.io/en/latest/customizing/user-environment.html#using-multiple-profiles-to-let-users-select-their-environment
[2] my micro
profile:
{
"display_name": "micro",
"slug": "micro",
"description": "Useful for scale testing a lot of pods",
"default": true,
"kubespawner_override": {
"cpu_guarantee": 0.015,
"cpu_limit": 1,
"mem_guarantee": "64M",
"mem_limit": "1G"
}
}
[3] Consider defaulting k8s_api_threadpool_workers to c.JupyterHub.concurrent_spawn_limit · Issue #419 · jupyterhub/kubespawner · GitHub
[4] cull_idle_servers causes hub to go unresponsive in environments with 50,000 users · Issue #2954 · jupyterhub/jupyterhub · GitHub