When running helm upgrade, I get the following error some of the time but not all of the time:
Error: UPGRADE FAILED: release jhub failed, and has been rolled back due to atomic being set: pre-upgrade hooks failed: timed out waiting for the condition
The helm upgrade command is:
helm upgrade jhub jupyterhub/jupyterhub --version 0.10.2 --values config.yaml --timeout 30m0s --atomic
This same issue had come up historically. Noting that the z2jh documentation here suggests increasing the timeout and that the pre image puller hooks is enabled, I set the upgrade timeout to be 30 minutes. The image being pulled is quite large (11 GB). This seemed to have improved things for a month or so and deploys worked every time.
Within the last week this issue has come back but it is not consistent. Sometimes the deploy is successful and takes just a few minutes. Other times it runs until it hits the 30 minute timeout. The fact that it takes just a few minutes when it does succeed makes me think increasing the timeout further will not help (and 30 minutes is long to pull an image…).
I have inspected nodes individually and have reason to believe they have sufficient disks space to accommodate the new image (old images are periodically removed).
Any thoughts on where I can keep investigating? Are there more verbose logs describing what goes on during the “pre-upgrade” step?
This is on AWS. Here is the (redacted) config file:
hub: allowNamedServers: true proxy: secretToken: [REDACTED] https: enabled: true type: offload service: annotations: service.beta.kubernetes.io/aws-load-balancer-ssl-cert: [REDACTED] service.beta.kubernetes.io/aws-load-balancer-backend-protocol: "tcp" service.beta.kubernetes.io/aws-load-balancer-ssl-ports: "https" service.beta.kubernetes.io/aws-load-balancer-connection-idle-timeout: "3600" auth: admin: users: [REDACTED] type: google google: clientId: [REDACTED] clientSecret: [REDACTED] callbackUrl: [REDACTED] hostedDomain: [REDACTED] loginService: [REDACTED] whitelist: users: [REDACTED] scheduling: userScheduler: enabled: true podPriority: enabled: true userPlaceholder: enabled: true replicas: 1 userPods: nodeAffinity: matchNodePurpose: require cull: enabled: false timeout: 3600 every: 300 singleuser: defaultUrl: "/lab" lifecycleHooks: postStart: exec: command: [REDACTED] cpu: limit: 7.8 guarantee: 7.4 memory: limit: 58G guarantee: 56G storage: capacity: 64Gi extraVolumes: - name: jupyterhub-shared persistentVolumeClaim: claimName: jupyterhub-shared-efs-claim - name: shm-volume emptyDir: medium: Memory extraVolumeMounts: - name: jupyterhub-shared mountPath: /home/shared - name: shm-volume mountPath: /dev/shm image: name: [REDACTED] tag: [REDACTED] cmd: - "/bin/bash" - "-c" - > jupyterhub-singleuser --SingleUserNotebookApp.default_url=/lab --SingleUserNotebookApp.ResourceUseDisplay.track_cpu_percent=True