I have a JupyterHub running in Google Cloud that I created a few years ago following the Z2JH tutorial. It’s been running very well and I’m managing to keep it more-or-less up-to-date, but one of my recent updates has introduced a new error, which has gone unnoticed until now.
At login, my users can choose from multiple profiles, each with different levels of resources (CPU, memory etc.). For the smaller, commonly used profiles I have some “placeholder users” to ensure login is fast, but the larger machines trigger cluster autoscaling, which takes a few minutes.
My users also have access to a Postgres database. Connections between user pods and this database take place via CloudSQL-Proxy, which is a separate container running alongside the user containers (I think this is called a “sidecar” in kubernetes terminology?).
To get this working, my config.yaml
includes the following:
singleuser:
extraContainers:
- name: cloudsql-proxy
image: gcr.io/cloudsql-docker/gce-proxy:1.11
command: ["/cloud_sql_proxy",
"-instances=<redacted-database-instance>=tcp:5432",
"-credential_file=/secrets/cloudsql/credentials.json"]
volumeMounts:
- name: my-secrets-volume
mountPath: /secrets/cloudsql
readOnly: true
storage:
extraVolumes:
- name: my-secrets-volume
secret:
secretName: cloudsql-instance-credentials
This has worked well for a long time, but now logging in on one of the larger machines/profiles gives the following log:
Spawn failed: Timeout
Event log
Server requested
2021-10-20T08:12:15.064488Z [Warning] 0/4 nodes are available: 4 Insufficient cpu, 4 Insufficient memory.
2021-10-20T08:12:15.085806Z [Warning] 0/4 nodes are available: 4 Insufficient cpu, 4 Insufficient memory.
2021-10-20T08:12:19Z [Normal] pod triggered scale-up: [{<redacted-url>/-user-pool-hi-mem-f5e541d7-grp 0->1 (max: 10)}]
2021-10-20T08:13:03.591034Z [Warning] 0/5 nodes are available: 1 node(s) had taint {node.kubernetes.io/not-ready: }, that the pod didn't tolerate, 4 Insufficient cpu, 4 Insufficient memory.
2021-10-20T08:13:12.598326Z [Normal] Successfully assigned jhub/jupyter-jamessample to <redacted>-user-pool-hi-mem-f5e541d7-d3bj
2021-10-20T08:13:17Z [Normal] AttachVolume.Attach succeeded for volume "pvc-dae6d331-7238-11e9-aba5-42010a840fe3"
2021-10-20T08:13:47Z [Normal] Pulling image "jupyterhub/k8s-network-tools:0.11.1"
2021-10-20T08:13:49Z [Normal] Successfully pulled image "jupyterhub/k8s-network-tools:0.11.1" in 2.031672126s
2021-10-20T08:13:49Z [Normal] Created container block-cloud-metadata
2021-10-20T08:13:49Z [Normal] Started container block-cloud-metadata
2021-10-20T08:13:50Z [Normal] Pulling image "<my-image-name>:v0.3.12"
2021-10-20T08:17:32Z [Normal] Successfully pulled image "<my-image-name>:v0.3.12" in 3m42.70700124s
2021-10-20T08:17:32Z [Warning] Error: cannot find volume "volume-jamessample" to mount into container "notebook"
2021-10-20T08:17:32Z [Normal] Pulling image "gcr.io/cloudsql-docker/gce-proxy:1.11"
2021-10-20T08:17:40Z [Normal] Successfully pulled image "gcr.io/cloudsql-docker/gce-proxy:1.11" in 7.553934454s
2021-10-20T08:17:40Z [Warning] Error: cannot find volume "my-secrets-volume" to mount into container "cloudsql-proxy"
Spawn failed: Timeout
Strangely, if the user simply logs in again on the same machine/profile immediately after this error (i.e. before the cluster downscales), everything will work fine.
The errors
[Warning] Error: cannot find volume "volume-jamessample" to mount into container "notebook"
and
[Warning] Error: cannot find volume "my-secrets-volume" to mount into container "cloudsql-proxy"
are new. There is certainly a PVC named claim-jamessample
that contains all my data, and it’s mounted correctly when I login either using the “default” profile (with the user placeholders), or when logging in for the second time after seeing the error above. Similarly, all the secrets seem to be correctly created, and they’re found OK on the second attempt.
Does anyone have any suggestions, please? It’s not a massive problem, but it’s frustrating for users to have to login twice every time they want to start a more powerful machine. I already have
kubespawner_override:
start_timeout: 1200
in my config.yaml
, but the timeout error above happens long before this limit is reached i.e. I don’t think I’m hitting the “general” spawner timeout. Are there any more specific timeout settings for extraVolumes
or extraContainers
, for example?
Thank you!