Scheduler "insufficient memory.; waiting" errors - any suggestions?

Hi,

Sorry to keep posting - but

I rather bravely just tried doing a live demo for 150 people or so, and it failed for many people (most?), with messages in the scheduler log of form:

Unable to schedule jhub-testing/user-placeholder-0: no fit: 0/2 nodes are available: 2 Insufficient memory.; waiting

The relevant parts of my setup are (I believe) - config.yaml:

# https://zero-to-jupyterhub.readthedocs.io/en/latest/administrator/optimization.html
scheduling:
  userScheduler:
    enabled: true
  podPriority:
    enabled: true
  userPlaceholder:
    enabled: true
    replicas: 2

For creating the cluster I did:

gcloud beta container node-pools create user-pool \
  --machine-type n1-highmem-32 \
  --num-nodes 0 \
  --enable-autoscaling \
  --min-nodes 0 \
  --max-nodes 20 \
  --node-labels hub.jupyter.org/node-purpose=user \
  --node-taints hub.jupyter.org_dedicated=user:NoSchedule \
  --node-locations europe-west2 \
  --region europe-west2-b
  --cluster jhub-cluster-testing

The hub has entries like this in the log:

W 2020-07-22 14:00:49.434 JupyterHub user:692] a_user@strath.ac.uk's server never showed up at http://hub-7669ccf594-xzndd:38211/user/a_user@strath.ac.uk/ after 30 seconds. Giving up
[I 2020-07-22 14:00:49.434 JupyterHub spawner:1866] Deleting pod jupyter-a_user-40strath-2eac-2euk
[E 2020-07-22 14:00:49.483 JupyterHub app:1982] a_user@strath.ac.uk does not appear to be running at http://hub-7669ccf594-xzndd:38211/user/a_user@strath.ac.uk/, shutting it down.

Is it possible I have failed to enable autoscaling correctly? Any suggestions as to how I could debug?

Thanks much,

Matthew

My first guess is that your quotas prevented requesting a third node because your CPU quota in the zone was lower than 96. This link might take you to your CPU quota (not sure about console links for other users). You might see an informative error about the failure to autoscale on the kubernetes page in the console, but I’m not sure.

1 Like

Thanks - yes - that’s extremely helpful. Indeed I had a 24 CPU limit

I know that the default memory guarantee is 1GB - but is there a default user CPU guarantee? If it is - say 1 CPU - then I guess the node creation would have failed at 24 users - but the error about memory is a bit confusing in that case.

It would be very useful to have some way to test the cluster scaling to detect problems like this. Can you recommend any method for simulating the effect of - say - 100 users logging on at the same time?

Cheers,

Matthew

I’ve been working on a script that does just this against the hub in our testing environment. If I can get the script scrubbed of internal details I could post it on GitHub.

I’m mostly interested in pushing the limits of how many users/pods we can have running at a time before the hub crashes. So for my scale testing I’m leveraging the profile_list feature in KubeSpawner and z2jh [1] with a micro profile [2] for tiny pods. What I found out after doing this is for our case the size of the pods don’t matter as much because there is a hard pod limit of 110 on the user nodes for the K8S service in our (IBM) cloud (I thought I could pack ~500 micro pods on a 32GB RAM node but nope!). So I have to scale up user-placeholder replicas before running the load test so that we have enough user worker nodes ready to go otherwise the hub starts tipping over on the consecutive_failure_limit.

We’re using a postgresql 12 database and the default configurable-hub-proxy setup from z2jh.

One thing I’ve noted elsewhere is we needed to turn down c.JupyterHub.activity_resolution to keep CPU usage down on the hub once we get several hundred notebook pods.

Today I got up to 1700 active pods with only 2 failed server starts (hit the 5 minute start_timeout from z2jh). That’s not our limit yet, I just have run out of time to keep running this thing today. :slight_smile:

Some other notes on my testing:

  • Set c.JupyterHub.init_spawners_timeout = 1 so we’re not waiting for that on hub restarts.
  • Set c.KubeSpawner.k8s_api_threadpool_workers = c.JupyterHub.concurrent_spawn_limit though this doesn’t seem to make much difference [3].
  • I’m setting c.NotebookApp.shutdown_no_activity_timeout and c.MappingKernelManager.cull_idle_timeout to 8 hours from our normal 1 hour timeout just so that notebook pods aren’t killing themselves while I’m scaling up in bursts because I want to see the hub response rate and CPU resource usage in steady state.
  • We’ve overridden the cull.concurrency value from 10 to 1 just so we’re not hammering the hub API with large GET /users requests when we have hundreds of users [4].

Something I need to do next is run py-spy while the hub is in steady state with a large number of pods running (1K+) to see where time is being spent and look for optimizations.

The script has to handle retrying on 429 responses because I hit concurrent_spawn_limit frequently during scale up but the script is python and with the requests/urllib3 libraries that’s easy to configure.

I was hoping to do a more thoughtful write up on this at some point with findings and results but I’m still in the progress of doing it, though I couldn’t resist replying to your question given the overlap with what I’m currently doing. Again, if I can get this script scrubbed a bit I’m happy to share it on GitHub but it might be a bit.

[1] https://zero-to-jupyterhub.readthedocs.io/en/latest/customizing/user-environment.html#using-multiple-profiles-to-let-users-select-their-environment
[2] my micro profile:

  {
    "display_name": "micro",
    "slug": "micro",
    "description": "Useful for scale testing a lot of pods",
    "default": true,
    "kubespawner_override": {
      "cpu_guarantee": 0.015,
      "cpu_limit": 1,
      "mem_guarantee": "64M",
      "mem_limit": "1G"
    }
  }

[3] https://github.com/jupyterhub/kubespawner/issues/419
[4] https://github.com/jupyterhub/jupyterhub/issues/2954

1 Like

@mriedem - thank you, that’s very helpful. As you could imagine, I would love to try out your script. Would you consider sharing in relatively raw state? I’m very happy to help work on it to make it more general.

Cheers,

Matthew

Here is a generalized version of the script [1]. I’m working on creating a repo in GitHub to post the code and some minimal docs and Travis test setup.

[1] https://gist.github.com/mriedem/aaec3d4c209f032657e1bf8b143124b4

It would be very useful to have some way to test the cluster scaling to detect problems like this

The easiest way is to scale-up the placeholder pods replicas. You can set this in the helm chart, or kubectl scale the placeholder replicaset. They request the same CPU/memory resources as real users, so are a good and quick approximation of the deployment adding that number of users. They don’t request volumes, though, so it’s not perfect unless you do a real simulation of users. But it is very simple and gets you 90% there.