Backoff after failed scale up

Hello! I am using hubploy + pangeo to deploy a jupyterhub on GCP. After one user is logged into the hub and is using substantial resources, I try to log in, but keep running into this message:

and then the spawn times out and fails. During the attempted spawn, sometimes an additional machine will be added to the cluster, but it apparently can’t be used, because I never manage to get in there. Trying kubectl logs jupyter-arokem -n l2lhub-prod gives me nothing back. Might this have something to do with the pod affinities? How do I go about mucking with that to fix it? Thanks!

Can you point us to the helm charts + config you are using?

“scale up” and auto-scaling also depends on how you configured the node pools in your GKE cluster, so that config is also needed.

Off the top of my head I don’t know when this error occurs :-/

The configuration for this hub is all in this repo: https://github.com/learning-2-learn/l2lhub-deployment

How do I get the GKE config?

Thanks!

After a quick look I can’t spot something obviously wrong in the config.

Which commands did you run to setup your GKE cluster and node pools? I don’t think there is a way to export it so those are probably the best we have.

As you are using hubploy it is probably worth getting @yuvipanda involved.

In the node config for your GKE cluster, there should be a ‘autoscaling -> max number of nodes’. Can you check what that is?

I think the ‘1 max node group size reached’ is the important part, not the backoff.

I am not sure where to find that particular thing, but does this answer your question?

My first hunch is to check your quotas. It’s possible that you are using up your CPU or memory quota so scale-up is failing because the next node would exceed some quota.

Thanks! That is a very good hunch. Indeed, this cluster used to be in another zone, which had the CPU quota set much higher. Now, in a zone with a rather limited quota, which might explain.