Backoff after failed scale up

Hello! I am using hubploy + pangeo to deploy a jupyterhub on GCP. After one user is logged into the hub and is using substantial resources, I try to log in, but keep running into this message:

and then the spawn times out and fails. During the attempted spawn, sometimes an additional machine will be added to the cluster, but it apparently can’t be used, because I never manage to get in there. Trying kubectl logs jupyter-arokem -n l2lhub-prod gives me nothing back. Might this have something to do with the pod affinities? How do I go about mucking with that to fix it? Thanks!

Can you point us to the helm charts + config you are using?

“scale up” and auto-scaling also depends on how you configured the node pools in your GKE cluster, so that config is also needed.

Off the top of my head I don’t know when this error occurs :-/

The configuration for this hub is all in this repo: https://github.com/learning-2-learn/l2lhub-deployment

How do I get the GKE config?

Thanks!

After a quick look I can’t spot something obviously wrong in the config.

Which commands did you run to setup your GKE cluster and node pools? I don’t think there is a way to export it so those are probably the best we have.

As you are using hubploy it is probably worth getting @yuvipanda involved.

In the node config for your GKE cluster, there should be a ‘autoscaling -> max number of nodes’. Can you check what that is?

I think the ‘1 max node group size reached’ is the important part, not the backoff.