Scheduler "insufficient memory.; waiting" errors - any suggestions?

I’ve been working on a script that does just this against the hub in our testing environment. If I can get the script scrubbed of internal details I could post it on GitHub.

I’m mostly interested in pushing the limits of how many users/pods we can have running at a time before the hub crashes. So for my scale testing I’m leveraging the profile_list feature in KubeSpawner and z2jh [1] with a micro profile [2] for tiny pods. What I found out after doing this is for our case the size of the pods don’t matter as much because there is a hard pod limit of 110 on the user nodes for the K8S service in our (IBM) cloud (I thought I could pack ~500 micro pods on a 32GB RAM node but nope!). So I have to scale up user-placeholder replicas before running the load test so that we have enough user worker nodes ready to go otherwise the hub starts tipping over on the consecutive_failure_limit.

We’re using a postgresql 12 database and the default configurable-hub-proxy setup from z2jh.

One thing I’ve noted elsewhere is we needed to turn down c.JupyterHub.activity_resolution to keep CPU usage down on the hub once we get several hundred notebook pods.

Today I got up to 1700 active pods with only 2 failed server starts (hit the 5 minute start_timeout from z2jh). That’s not our limit yet, I just have run out of time to keep running this thing today. :slight_smile:

Some other notes on my testing:

  • Set c.JupyterHub.init_spawners_timeout = 1 so we’re not waiting for that on hub restarts.
  • Set c.KubeSpawner.k8s_api_threadpool_workers = c.JupyterHub.concurrent_spawn_limit though this doesn’t seem to make much difference [3].
  • I’m setting c.NotebookApp.shutdown_no_activity_timeout and c.MappingKernelManager.cull_idle_timeout to 8 hours from our normal 1 hour timeout just so that notebook pods aren’t killing themselves while I’m scaling up in bursts because I want to see the hub response rate and CPU resource usage in steady state.
  • We’ve overridden the cull.concurrency value from 10 to 1 just so we’re not hammering the hub API with large GET /users requests when we have hundreds of users [4].

Something I need to do next is run py-spy while the hub is in steady state with a large number of pods running (1K+) to see where time is being spent and look for optimizations.

The script has to handle retrying on 429 responses because I hit concurrent_spawn_limit frequently during scale up but the script is python and with the requests/urllib3 libraries that’s easy to configure.

I was hoping to do a more thoughtful write up on this at some point with findings and results but I’m still in the progress of doing it, though I couldn’t resist replying to your question given the overlap with what I’m currently doing. Again, if I can get this script scrubbed a bit I’m happy to share it on GitHub but it might be a bit.

[1] https://zero-to-jupyterhub.readthedocs.io/en/latest/customizing/user-environment.html#using-multiple-profiles-to-let-users-select-their-environment
[2] my micro profile:

  {
    "display_name": "micro",
    "slug": "micro",
    "description": "Useful for scale testing a lot of pods",
    "default": true,
    "kubespawner_override": {
      "cpu_guarantee": 0.015,
      "cpu_limit": 1,
      "mem_guarantee": "64M",
      "mem_limit": "1G"
    }
  }

[3] Consider defaulting k8s_api_threadpool_workers to c.JupyterHub.concurrent_spawn_limit · Issue #419 · jupyterhub/kubespawner · GitHub
[4] cull_idle_servers causes hub to go unresponsive in environments with 50,000 users · Issue #2954 · jupyterhub/jupyterhub · GitHub

1 Like