Debugging a rather strange autoscaling? problem

I’m having trouble understanding how I debug a cluster refusing to schedule on any more than 3 nodes for user pods. Its 3 all the time. At first i thought it was because i asked for 3 placeholders but asking for 5 didnt seem to help. Here are the relevent (i think…) parts of the config.yaml (this on us-east1-c), and I crested the user pool with

gcloud beta container node-pools create jhub-cpu-pool  --machine-type n1-standard-2  --num-nodes 1  --enable-autoscaling  --min-nodes 1  --max-nodes 34  --node-labels hub.jupyter.org/node-purpose=user  --node-taints hub.jupyter.org_dedicated=user:NoSchedule --zone us-east1-c --cluster univai-jhub
prePuller:
  continuous:
    enabled: true
scheduling:
  userScheduler:
    enabled: true
  podPriority:
    enabled: true
  userPlaceholder:
    enabled: true
    replicas: 5
  userPods:
    nodeAffinity:
      matchNodePurpose: require

singleuser:
  image:
    # Get the latest image tag at:
    # https://hub.docker.com/r/jupyter/datascience-notebook/tags/
    # Inspect the Dockerfile at:
    # https://github.com/jupyter/docker-stacks/tree/master/datascience-notebook/Dockerfile
    name: gcr.io/univai-jupyterhub/tensorflow_pytorch_cpu
    tag: ae6c90f96f5c025cc66317b27bac8722db7a4097
  defaultUrl: "/lab"
  storage:
    capacity: 4Gi
  memory:
    guarantee: 1G
    limit: 3.5G
  cpu:
    guarantee: 0.5
    limit: 2

Who ordered the number 3? Its quite peculiar. I thought it was CPU quotas but my quota for CPU(all zones) is 64!

Would appreciate any thoughts on how i might figure whats happening. The error message is

which makes perfect sense since 2 nodes belong to the non-user part of the cluster and wont be scheduled on. The other message seems to reflect the inability of user pods to be scheduled on more than 3 nodes…

EDIT: should add I have one more cluster running in the same GCP project, but its not occupying more than 4 nodes…

Also, how could I artificially add “fake” user pods to debug?

I would guess it’s a quota of some sort being hit. You can check cluster logs in the google cloud console or check the autoscaler status with:

kubectl get configmap -n kube-system cluster-autoscaler-status -o yaml

you can scale up the user-placeholders to trigger auto-scaling:

kubectl scale statefulset user-placeholder --replicas=10
1 Like

I tried to add 15! replicas (after trying 10). No move beyond 3 nodes in the use-pool.

Then i looked at quotas. Good on all except ip addresses. Could that be hurting me?

Here is the log output. I dont seem to see anything wacky on it, but the nodegroup finds only 3, whereas it is set to autoscale from 1 to 33…

You were absolutely right…it turned out that i did not have enough ip addresses in US-east in the quota, and then only 24 virtual cpus. And then it turns out i have 64 gpus in us-east but not a single one in all zones so I cant provison even a single gpu machine.

Its very non-intuitive, this GCP quota business. The failure was quite silent, i just couldnt autoscale.

Thanks also for the autoscaling advice. Its an incredible way to test!!