I’ve been working on a script that does just this against the hub in our testing environment. If I can get the script scrubbed of internal details I could post it on GitHub.
I’m mostly interested in pushing the limits of how many users/pods we can have running at a time before the hub crashes. So for my scale testing I’m leveraging the
profile_list feature in KubeSpawner and z2jh  with a
micro profile  for tiny pods. What I found out after doing this is for our case the size of the pods don’t matter as much because there is a hard pod limit of 110 on the user nodes for the K8S service in our (IBM) cloud (I thought I could pack ~500 micro pods on a 32GB RAM node but nope!). So I have to scale up
user-placeholder replicas before running the load test so that we have enough user worker nodes ready to go otherwise the hub starts tipping over on the consecutive_failure_limit.
We’re using a postgresql 12 database and the default configurable-hub-proxy setup from z2jh.
One thing I’ve noted elsewhere is we needed to turn down c.JupyterHub.activity_resolution to keep CPU usage down on the hub once we get several hundred notebook pods.
Today I got up to 1700 active pods with only 2 failed server starts (hit the 5 minute start_timeout from z2jh). That’s not our limit yet, I just have run out of time to keep running this thing today.
Some other notes on my testing:
c.JupyterHub.init_spawners_timeout = 1 so we’re not waiting for that on hub restarts.
c.KubeSpawner.k8s_api_threadpool_workers = c.JupyterHub.concurrent_spawn_limit though this doesn’t seem to make much difference .
- I’m setting
c.MappingKernelManager.cull_idle_timeout to 8 hours from our normal 1 hour timeout just so that notebook pods aren’t killing themselves while I’m scaling up in bursts because I want to see the hub response rate and CPU resource usage in steady state.
- We’ve overridden the cull.concurrency value from 10 to 1 just so we’re not hammering the hub API with large
GET /users requests when we have hundreds of users .
Something I need to do next is run py-spy while the hub is in steady state with a large number of pods running (1K+) to see where time is being spent and look for optimizations.
The script has to handle retrying on 429 responses because I hit concurrent_spawn_limit frequently during scale up but the script is python and with the requests/urllib3 libraries that’s easy to configure.
I was hoping to do a more thoughtful write up on this at some point with findings and results but I’m still in the progress of doing it, though I couldn’t resist replying to your question given the overlap with what I’m currently doing. Again, if I can get this script scrubbed a bit I’m happy to share it on GitHub but it might be a bit.
"description": "Useful for scale testing a lot of pods",