Helm upgrade intermittently times out

When running helm upgrade, I get the following error some of the time but not all of the time:

Error: UPGRADE FAILED: release jhub failed, and has been rolled back due to atomic being set: pre-upgrade hooks failed: timed out waiting for the condition

The helm upgrade command is:

helm upgrade jhub jupyterhub/jupyterhub --version 0.10.2 --values config.yaml --timeout 30m0s --atomic

This same issue had come up historically. Noting that the z2jh documentation here suggests increasing the timeout and that the pre image puller hooks is enabled, I set the upgrade timeout to be 30 minutes. The image being pulled is quite large (11 GB). This seemed to have improved things for a month or so and deploys worked every time.

Within the last week this issue has come back but it is not consistent. Sometimes the deploy is successful and takes just a few minutes. Other times it runs until it hits the 30 minute timeout. The fact that it takes just a few minutes when it does succeed makes me think increasing the timeout further will not help (and 30 minutes is long to pull an image…).

I have inspected nodes individually and have reason to believe they have sufficient disks space to accommodate the new image (old images are periodically removed).

Any thoughts on where I can keep investigating? Are there more verbose logs describing what goes on during the “pre-upgrade” step?

This is on AWS. Here is the (redacted) config file:

hub:
  allowNamedServers: true


proxy:
  secretToken: [REDACTED]
  https:
    enabled: true
    type: offload
  service:
    annotations:
      service.beta.kubernetes.io/aws-load-balancer-ssl-cert: [REDACTED]
      service.beta.kubernetes.io/aws-load-balancer-backend-protocol: "tcp"
      service.beta.kubernetes.io/aws-load-balancer-ssl-ports: "https"
      service.beta.kubernetes.io/aws-load-balancer-connection-idle-timeout: "3600"


auth:
  admin:
    users: [REDACTED]
  type: google
  google:
    clientId: [REDACTED]
    clientSecret: [REDACTED]
    callbackUrl: [REDACTED]
    hostedDomain: [REDACTED]
    loginService: [REDACTED]
  whitelist:
    users: [REDACTED]


scheduling:
  userScheduler:
    enabled: true
  podPriority:
    enabled: true
  userPlaceholder:
    enabled: true
    replicas: 1
  userPods:
    nodeAffinity:
      matchNodePurpose: require


cull:
  enabled: false
  timeout: 3600
  every: 300


singleuser:
  defaultUrl: "/lab"
  lifecycleHooks:
    postStart:
      exec:
        command: [REDACTED]
  cpu:
    limit: 7.8
    guarantee: 7.4
  memory:
    limit: 58G
    guarantee: 56G
  storage:
    capacity: 64Gi
    extraVolumes:
      - name: jupyterhub-shared
        persistentVolumeClaim:
          claimName: jupyterhub-shared-efs-claim
      - name: shm-volume
        emptyDir:
          medium: Memory
    extraVolumeMounts:
        - name: jupyterhub-shared
          mountPath: /home/shared
        - name: shm-volume
          mountPath: /dev/shm
  image:
    name: [REDACTED]
    tag: [REDACTED]
  cmd:
    - "/bin/bash"
    - "-c"
    - >
      jupyterhub-singleuser
      --SingleUserNotebookApp.default_url=/lab
      --SingleUserNotebookApp.ResourceUseDisplay.track_cpu_percent=True

Ah I should also mention the Helm upgrade command is being run automatically from a GitHub action. Likely not relevant but just one more piece of information.

Can you try running something like kubectl describe pods whilst the upgrade is running? This may tell you whether one of the pods is having scheduling problems, and if so why.