Backoff after failed scale up

arokem · February 13, 2020, 11:03pm

Hello! I am using hubploy + pangeo to deploy a jupyterhub on GCP. After one user is logged into the hub and is using substantial resources, I try to log in, but keep running into this message:

and then the spawn times out and fails. During the attempted spawn, sometimes an additional machine will be added to the cluster, but it apparently can’t be used, because I never manage to get in there. Trying kubectl logs jupyter-arokem -n l2lhub-prod gives me nothing back. Might this have something to do with the pod affinities? How do I go about mucking with that to fix it? Thanks!

betatim · February 17, 2020, 6:17am

Can you point us to the helm charts + config you are using?

“scale up” and auto-scaling also depends on how you configured the node pools in your GKE cluster, so that config is also needed.

Off the top of my head I don’t know when this error occurs :-/

arokem · February 19, 2020, 6:47am

The configuration for this hub is all in this repo: https://github.com/learning-2-learn/l2lhub-deployment

How do I get the GKE config?

Thanks!

betatim · February 19, 2020, 7:07am

After a quick look I can’t spot something obviously wrong in the config.

Which commands did you run to setup your GKE cluster and node pools? I don’t think there is a way to export it so those are probably the best we have.

As you are using hubploy it is probably worth getting @yuvipanda involved.

yuvipanda · February 19, 2020, 7:41am

In the node config for your GKE cluster, there should be a ‘autoscaling -> max number of nodes’. Can you check what that is?

I think the ‘1 max node group size reached’ is the important part, not the backoff.

arokem · February 19, 2020, 5:52pm

I am not sure where to find that particular thing, but does this answer your question?

minrk · February 21, 2020, 12:21pm

My first hunch is to check your quotas. It’s possible that you are using up your CPU or memory quota so scale-up is failing because the next node would exceed some quota.

arokem · February 21, 2020, 1:34pm

Thanks! That is a very good hunch. Indeed, this cluster used to be in another zone, which had the CPU quota set much higher. Now, in a zone with a rather limited quota, which might explain.

Topic		Replies	Views
GKE autoscale test failure, cluster RECONCILING Zero to JupyterHub on Kubernetes	7	2020	January 30, 2021
Scheduler "insufficient memory.; waiting" errors - any suggestions? JupyterHub	7	1961	August 28, 2020
Debugging a rather strange autoscaling? problem Zero to JupyterHub on Kubernetes	3	1055	June 12, 2019
Jupyterhub AutoScaling AWS Zero to JupyterHub on Kubernetes	0	805	September 20, 2019
Auto-scaling based on CPU-usage? General jupyterhub	9	4292	May 8, 2020

Backoff after failed scale up

Related topics