I’ve been having a lot of trouble getting my GKE cluster to autoscale without breaking down.
I’m testing auto-scaling by
- starting the cluster with no user-placeholders
- changing the
config.yaml
to these lines:
userPlaceholder:
enabled: true
replicas: 110
- Doing a
helm upgrade
with this change in configuration.
I see the cluster scaling up, and the 110 user-placeholder pods arriving, as seen from kubectl get pod
, but at around the point they have all gone from Pending to Running, I get this from kubectl get pod
:
The connection to the server 35.197.204.116 was refused - did you specify the right host or port?
The connection to the server 35.197.204.116 was refused - did you specify the right host or port?
At that point the output of gcloud container cluster list
is:
jhub-cluster-testing europe-west2-b 1.17.14-gke.1600 35.197.204.116 n1-standard-2 1.17.14-gke.1600 23 RECONCILING
It returned to status RUNNING after about 5 minutes, at which point I could run kubectl get pod
again, and I found that one of my user-scheduler
pods had restarted.
--previous
log from that pod has this before a long Go traceback:
I1228 18:27:33.449506 1 httplog.go:89] "HTTP" verb="GET" URI="/healthz" latency="64.811µs" userAgent="kube-probe/1.17+" srcIP="10.48.0.1:54504" resp=200
E1228 18:27:33.537801 1 leaderelection.go:321] error retrieving resource lock jhub-testing/user-scheduler-lock: Get "https://10.52.0.1:443/api/v1/namespaces/jhub-testing/endpoints/user-scheduler-lock?timeout=10s": dial tcp 10.52.0.1:443: connect: connection refused
E1228 18:27:35.539202 1 leaderelection.go:321] error retrieving resource lock jhub-testing/user-scheduler-lock: Get "https://10.52.0.1:443/api/v1/namespaces/jhub-testing/endpoints/user-scheduler-lock?timeout=10s": dial tcp 10.52.0.1:443: connect: connection refused
E1228 18:27:37.539418 1 leaderelection.go:321] error retrieving resource lock jhub-testing/user-scheduler-lock: Get "https://10.52.0.1:443/api/v1/namespaces/jhub-testing/endpoints/user-scheduler-lock?timeout=10s": dial tcp 10.52.0.1:443: connect: connection refused
E1228 18:27:39.539003 1 leaderelection.go:321] error retrieving resource lock jhub-testing/user-scheduler-lock: Get "https://10.52.0.1:443/api/v1/namespaces/jhub-testing/endpoints/user-scheduler-lock?timeout=10s": dial tcp 10.52.0.1:443: connect: connection refused
E1228 18:27:41.539383 1 leaderelection.go:321] error retrieving resource lock jhub-testing/user-scheduler-lock: Get "https://10.52.0.1:443/api/v1/namespaces/jhub-testing/endpoints/user-scheduler-lock?timeout=10s": dial tcp 10.52.0.1:443: connect: connection refused
I1228 18:27:41.986672 1 httplog.go:89] "HTTP" verb="GET" URI="/healthz" latency="75.856µs" userAgent="kube-probe/1.17+" srcIP="10.48.0.1:55752" resp=200
I1228 18:27:43.449248 1 httplog.go:89] "HTTP" verb="GET" URI="/healthz" latency="81.653µs" userAgent="kube-probe/1.17+" srcIP="10.48.0.1:56056" resp=200
I1228 18:27:43.536640 1 leaderelection.go:278] failed to renew lease jhub-testing/user-scheduler-lock: timed out waiting for the condition
F1228 18:27:43.536679 1 server.go:199] leaderelection lost
I’m using the 0.10.3 z2jh chart. If y’all are interested, the configuration is at https://github.com/matthew-brett/uob-jupyterhub/tree/uob-testing, where the config.yaml
file is the contents of config.yaml.cleaned
in that repo (minus some secrets).
Have y’all got any suggestions about where I can look to find out more what is going on here?
In my curent experiments (details below), spinning up 250 user-placeholders, and therefore 32 nodes, the cluster is locked in “RECONCILING” state. With a regional (rather than zonal) cluster, I can still communicate with the cluster, but 95 of the 250 placeholders meanwhile got jammed in “Pending” state, and stayed there for 14 minutes. The cluster meanwhile remains in “RECONCILING” state.
Current experiments were to move from a zonal cluster, to a regional cluster (this costs more, because I get one zonal cluster for free), and increasing the default-pool cluster machine from n1-standard-2 (7.5GB memory) to n1-highmem-4 (26GB memory).
Are your quotas doing ok? Sometimes the cluster tries to scale up, but hits a quota limit and fails. There’s a gcloud container operations list
command that might help.
Thanks for the reply.
I certainly did have problems with my quotas in the past - but I have just re-replicated the problem just now, with no quota issues visible via the Google quotas page for my project.
Thanks for the pointer to the gcloud command.
Replicating now, I get this sequence from gcloud container operations list
when I initialize the cluster without placeholders:
operation-1611860461503-9257a29a CREATE_CLUSTER europe-west2-b jhub-cluster-testing DONE 2021-01-28T19:01:01.503209451Z 2021-01-28T19:04:11.760193424Z
operation-1611860655270-4ff8e1df CREATE_NODE_POOL europe-west2-b user-pool DONE 2021-01-28T19:04:15.270402178Z 2021-01-28T19:04:30.257392795Z
Followed by this, when I do helm upgrade
with the large number of user placeholders:
operation-1611861245593-588e4c8b REPAIR_CLUSTER europe-west2-b jhub-cluster-testing DONE 2021-01-28T19:14:05.593471107Z 2021-01-28T19:18:54.600008372Z
operation-1611861246891-35870352 UPDATE_CLUSTER europe-west2-b jhub-cluster-testing DONE 2021-01-28T19:14:06.891691515Z 2021-01-28T19:14:07.234573603Z
gcloud container operations describe <operation-id>
for the REPAIR_CLUSTER
operation doesn’t give much useful information.
The helm upgrade fails with messages like:
Error: UPGRADE FAILED: an error occurred while finding last successful release. original upgrade error: Get "https://35.242.156.98/api/v1/namespaces/jhub-testing/persistentvolumeclaims/hub-db-dir": dial tcp 35.242.156.98:443: connect: connection refused: Kubernetes cluster unreachable: Get "https://35.242.156.98/version?timeout=32s": dial tcp 35.242.156.98:443: connect: connection refused
I think this is just a function of the cluster becoming unresponsive. The message from the previous failure, with the cluster having the same configuration, an hour or so ago, was:
Error: UPGRADE FAILED: an error occurred while finding last successful release. original upgrade error: Get "https://35.246.100.151/apis/apps/v1/namespaces/jhub-testing/replicasets?labelSelector=app%3Djupyterhub%2Ccomponent%3Dproxy%2Crelease%3Djhub-testing": dial tcp 35.246.100.151:443: connect: connection refused: Kubernetes cluster unreachable: Get "https://35.246.100.151/version?timeout=32s": dial tcp 35.246.100.151:443: connect: connection refused
I added --no-enable-autoupgrade
to the cluster creation command, with no change to the outcome.
Changing the cluster from zonal to regional allows the upgrade to succeed without error. The cluster still goes into RECONCILING state, but it remains responsive, and the pods all come up.
Do you think having a GKE regional cluster is a requirement for autoscaling? Or is there a way to mitigate the issue for zonal clusters?
Not sure about regional vs zonal. For mybinder.org we’ve been using both and in neither case seen a problem. However the maximum number of nodes in the cluster is rarely (if ever) over 15.
Is there a magic number of pods/new nodes where it works? Say if you only add 50 pods instead of 250.
Thanks again for the reply. For my setup, 250 pods is about 21 nodes. I suppose too that I may have larger images than for a classic mybinder setup - my images have a “virtual size” on eu.gcr.io of 1.1GB. I haven’t experimented with fewer nodes yet - I’ll try to do that over the next few days.
1 Like