GKE autoscale test failure, cluster RECONCILING

I’ve been having a lot of trouble getting my GKE cluster to autoscale without breaking down.

I’m testing auto-scaling by

  • starting the cluster with no user-placeholders
  • changing the config.yaml to these lines:
  userPlaceholder:
    enabled: true
    replicas: 110
  • Doing a helm upgrade with this change in configuration.

I see the cluster scaling up, and the 110 user-placeholder pods arriving, as seen from kubectl get pod, but at around the point they have all gone from Pending to Running, I get this from kubectl get pod:

The connection to the server 35.197.204.116 was refused - did you specify the right host or port?
The connection to the server 35.197.204.116 was refused - did you specify the right host or port?

At that point the output of gcloud container cluster list is:

jhub-cluster-testing  europe-west2-b  1.17.14-gke.1600  35.197.204.116  n1-standard-2  1.17.14-gke.1600  23         RECONCILING

It returned to status RUNNING after about 5 minutes, at which point I could run kubectl get pod again, and I found that one of my user-scheduler pods had restarted.

--previous log from that pod has this before a long Go traceback:

I1228 18:27:33.449506       1 httplog.go:89] "HTTP" verb="GET" URI="/healthz" latency="64.811µs" userAgent="kube-probe/1.17+" srcIP="10.48.0.1:54504" resp=200
E1228 18:27:33.537801       1 leaderelection.go:321] error retrieving resource lock jhub-testing/user-scheduler-lock: Get "https://10.52.0.1:443/api/v1/namespaces/jhub-testing/endpoints/user-scheduler-lock?timeout=10s": dial tcp 10.52.0.1:443: connect: connection refused
E1228 18:27:35.539202       1 leaderelection.go:321] error retrieving resource lock jhub-testing/user-scheduler-lock: Get "https://10.52.0.1:443/api/v1/namespaces/jhub-testing/endpoints/user-scheduler-lock?timeout=10s": dial tcp 10.52.0.1:443: connect: connection refused
E1228 18:27:37.539418       1 leaderelection.go:321] error retrieving resource lock jhub-testing/user-scheduler-lock: Get "https://10.52.0.1:443/api/v1/namespaces/jhub-testing/endpoints/user-scheduler-lock?timeout=10s": dial tcp 10.52.0.1:443: connect: connection refused
E1228 18:27:39.539003       1 leaderelection.go:321] error retrieving resource lock jhub-testing/user-scheduler-lock: Get "https://10.52.0.1:443/api/v1/namespaces/jhub-testing/endpoints/user-scheduler-lock?timeout=10s": dial tcp 10.52.0.1:443: connect: connection refused
E1228 18:27:41.539383       1 leaderelection.go:321] error retrieving resource lock jhub-testing/user-scheduler-lock: Get "https://10.52.0.1:443/api/v1/namespaces/jhub-testing/endpoints/user-scheduler-lock?timeout=10s": dial tcp 10.52.0.1:443: connect: connection refused
I1228 18:27:41.986672       1 httplog.go:89] "HTTP" verb="GET" URI="/healthz" latency="75.856µs" userAgent="kube-probe/1.17+" srcIP="10.48.0.1:55752" resp=200
I1228 18:27:43.449248       1 httplog.go:89] "HTTP" verb="GET" URI="/healthz" latency="81.653µs" userAgent="kube-probe/1.17+" srcIP="10.48.0.1:56056" resp=200
I1228 18:27:43.536640       1 leaderelection.go:278] failed to renew lease jhub-testing/user-scheduler-lock: timed out waiting for the condition
F1228 18:27:43.536679       1 server.go:199] leaderelection lost

I’m using the 0.10.3 z2jh chart. If y’all are interested, the configuration is at https://github.com/matthew-brett/uob-jupyterhub/tree/uob-testing, where the config.yaml file is the contents of config.yaml.cleaned in that repo (minus some secrets).

Have y’all got any suggestions about where I can look to find out more what is going on here?

In my curent experiments (details below), spinning up 250 user-placeholders, and therefore 32 nodes, the cluster is locked in “RECONCILING” state. With a regional (rather than zonal) cluster, I can still communicate with the cluster, but 95 of the 250 placeholders meanwhile got jammed in “Pending” state, and stayed there for 14 minutes. The cluster meanwhile remains in “RECONCILING” state.

Current experiments were to move from a zonal cluster, to a regional cluster (this costs more, because I get one zonal cluster for free), and increasing the default-pool cluster machine from n1-standard-2 (7.5GB memory) to n1-highmem-4 (26GB memory).

Are your quotas doing ok? Sometimes the cluster tries to scale up, but hits a quota limit and fails. There’s a gcloud container operations list command that might help.

Thanks for the reply.

I certainly did have problems with my quotas in the past - but I have just re-replicated the problem just now, with no quota issues visible via the Google quotas page for my project.

Thanks for the pointer to the gcloud command.

Replicating now, I get this sequence from gcloud container operations list when I initialize the cluster without placeholders:

operation-1611860461503-9257a29a  CREATE_CLUSTER    europe-west2-b  jhub-cluster-testing                  DONE    2021-01-28T19:01:01.503209451Z  2021-01-28T19:04:11.760193424Z
operation-1611860655270-4ff8e1df  CREATE_NODE_POOL  europe-west2-b  user-pool                             DONE    2021-01-28T19:04:15.270402178Z  2021-01-28T19:04:30.257392795Z

Followed by this, when I do helm upgrade with the large number of user placeholders:

operation-1611861245593-588e4c8b  REPAIR_CLUSTER    europe-west2-b  jhub-cluster-testing                  DONE    2021-01-28T19:14:05.593471107Z  2021-01-28T19:18:54.600008372Z
operation-1611861246891-35870352  UPDATE_CLUSTER    europe-west2-b  jhub-cluster-testing                  DONE    2021-01-28T19:14:06.891691515Z  2021-01-28T19:14:07.234573603Z

gcloud container operations describe <operation-id> for the REPAIR_CLUSTER operation doesn’t give much useful information.

The helm upgrade fails with messages like:

Error: UPGRADE FAILED: an error occurred while finding last successful release. original upgrade error: Get "https://35.242.156.98/api/v1/namespaces/jhub-testing/persistentvolumeclaims/hub-db-dir": dial tcp 35.242.156.98:443: connect: connection refused: Kubernetes cluster unreachable: Get "https://35.242.156.98/version?timeout=32s": dial tcp 35.242.156.98:443: connect: connection refused

I think this is just a function of the cluster becoming unresponsive. The message from the previous failure, with the cluster having the same configuration, an hour or so ago, was:

Error: UPGRADE FAILED: an error occurred while finding last successful release. original upgrade error: Get "https://35.246.100.151/apis/apps/v1/namespaces/jhub-testing/replicasets?labelSelector=app%3Djupyterhub%2Ccomponent%3Dproxy%2Crelease%3Djhub-testing": dial tcp 35.246.100.151:443: connect: connection refused: Kubernetes cluster unreachable: Get "https://35.246.100.151/version?timeout=32s": dial tcp 35.246.100.151:443: connect: connection refused

I added --no-enable-autoupgrade to the cluster creation command, with no change to the outcome.

Changing the cluster from zonal to regional allows the upgrade to succeed without error. The cluster still goes into RECONCILING state, but it remains responsive, and the pods all come up.

Do you think having a GKE regional cluster is a requirement for autoscaling? Or is there a way to mitigate the issue for zonal clusters?

Not sure about regional vs zonal. For mybinder.org we’ve been using both and in neither case seen a problem. However the maximum number of nodes in the cluster is rarely (if ever) over 15.

Is there a magic number of pods/new nodes where it works? Say if you only add 50 pods instead of 250.

Thanks again for the reply. For my setup, 250 pods is about 21 nodes. I suppose too that I may have larger images than for a classic mybinder setup - my images have a “virtual size” on eu.gcr.io of 1.1GB. I haven’t experimented with fewer nodes yet - I’ll try to do that over the next few days.

1 Like