Helm upgrade intermittently times out

When running helm upgrade, I get the following error some of the time but not all of the time:

Error: UPGRADE FAILED: release jhub failed, and has been rolled back due to atomic being set: pre-upgrade hooks failed: timed out waiting for the condition

The helm upgrade command is:

helm upgrade jhub jupyterhub/jupyterhub --version 0.10.2 --values config.yaml --timeout 30m0s --atomic

This same issue had come up historically. Noting that the z2jh documentation here suggests increasing the timeout and that the pre image puller hooks is enabled, I set the upgrade timeout to be 30 minutes. The image being pulled is quite large (11 GB). This seemed to have improved things for a month or so and deploys worked every time.

Within the last week this issue has come back but it is not consistent. Sometimes the deploy is successful and takes just a few minutes. Other times it runs until it hits the 30 minute timeout. The fact that it takes just a few minutes when it does succeed makes me think increasing the timeout further will not help (and 30 minutes is long to pull an image…).

I have inspected nodes individually and have reason to believe they have sufficient disks space to accommodate the new image (old images are periodically removed).

Any thoughts on where I can keep investigating? Are there more verbose logs describing what goes on during the “pre-upgrade” step?

This is on AWS. Here is the (redacted) config file:

hub:
  allowNamedServers: true


proxy:
  secretToken: [REDACTED]
  https:
    enabled: true
    type: offload
  service:
    annotations:
      service.beta.kubernetes.io/aws-load-balancer-ssl-cert: [REDACTED]
      service.beta.kubernetes.io/aws-load-balancer-backend-protocol: "tcp"
      service.beta.kubernetes.io/aws-load-balancer-ssl-ports: "https"
      service.beta.kubernetes.io/aws-load-balancer-connection-idle-timeout: "3600"


auth:
  admin:
    users: [REDACTED]
  type: google
  google:
    clientId: [REDACTED]
    clientSecret: [REDACTED]
    callbackUrl: [REDACTED]
    hostedDomain: [REDACTED]
    loginService: [REDACTED]
  whitelist:
    users: [REDACTED]


scheduling:
  userScheduler:
    enabled: true
  podPriority:
    enabled: true
  userPlaceholder:
    enabled: true
    replicas: 1
  userPods:
    nodeAffinity:
      matchNodePurpose: require


cull:
  enabled: false
  timeout: 3600
  every: 300


singleuser:
  defaultUrl: "/lab"
  lifecycleHooks:
    postStart:
      exec:
        command: [REDACTED]
  cpu:
    limit: 7.8
    guarantee: 7.4
  memory:
    limit: 58G
    guarantee: 56G
  storage:
    capacity: 64Gi
    extraVolumes:
      - name: jupyterhub-shared
        persistentVolumeClaim:
          claimName: jupyterhub-shared-efs-claim
      - name: shm-volume
        emptyDir:
          medium: Memory
    extraVolumeMounts:
        - name: jupyterhub-shared
          mountPath: /home/shared
        - name: shm-volume
          mountPath: /dev/shm
  image:
    name: [REDACTED]
    tag: [REDACTED]
  cmd:
    - "/bin/bash"
    - "-c"
    - >
      jupyterhub-singleuser
      --SingleUserNotebookApp.default_url=/lab
      --SingleUserNotebookApp.ResourceUseDisplay.track_cpu_percent=True

Ah I should also mention the Helm upgrade command is being run automatically from a GitHub action. Likely not relevant but just one more piece of information.

Can you try running something like kubectl describe pods whilst the upgrade is running? This may tell you whether one of the pods is having scheduling problems, and if so why.

Thanks for the suggestion. I ran kubectl describe pods during an upgrade and looked through the output. That was helpful to understand more of what is going on although I’m still not sure the underlying issue.

All of the pods except for one had a “Running” status. The one that did not was a hook-image-puller-xxxx pod and had a status “Pending”. Its events looked like

Events:
  Type    Reason     Age   From               Message
  ----    ------     ----  ----               -------
  Normal  Scheduled  30m   default-scheduler  Successfully assigned default/hook-image-puller-xhjxx to {REDACTED}
  Normal  Pulled     30m   kubelet            Container image "jupyterhub/k8s-network-tools:0.11.1" already present on machine
  Normal  Created    30m   kubelet            Created container image-pull-metadata-block
  Normal  Started    30m   kubelet            Started container image-pull-metadata-block
  Normal  Pulling    30m   kubelet            Pulling image "{REDACTED}"

and it included a container with this state:

  image-pull-singleuser:
    Container ID:
    Image:         {REDACTED}
    Image ID:
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/sh
      -c
      echo "Pulling complete"
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Requests:
      cpu:        0
      memory:     0
    Environment:  <none>
    Mounts:       <none>

I’m not quite sure where else to look here for a clue as to why this one particular image pulling pod is having trouble. I checked the node and it has enough disk space.

Running the helm upgrade ... command a second time succeeds (this is a pattern where if it fails, a retry always has worked).

One other thing I noted is that several pods (of various types and across various nodes) reported a lot of events like:

W0125 22:48:38.673520   16604 exec.go:271] constructing many client instances from the same exec auth config can cause performance problems during cert rotation and can exhaust available network connections; 1001 clients constructed calling "aws"

This may just be a warning an unrelated to the issue with upgrading but it stood out.

All of the pods except for one had a “Running” status. The one that did not was a hook-image-puller-xxxx pod and had a status “Pending”. Its events looked like

It had a “ContainerCreating” state rather than “pending”, right? A pod is pending before it has been scheduled.

I think what is happening is related to…

W0125 22:48:38.673520   16604 exec.go:271] constructing many client instances from the same exec auth config can cause performance problems during cert rotation and can exhaust available network connections; 1001 clients constructed calling "aws"

And, that in turn influence the ability to pull the image.


If you experience an unreliable behavior, you can disable the pre-pulling logic which is meant to make sure that all nodes are ready to start quickly before the hub pod is upgraded to start making user pods actually make use of the new image rather than the old.

prePuller:
  hook:
    enabled: false

On a separate note, please do upgrade to 0.10.6 or later as 0.10.0 - 0.10.5 have security vulnerability affecting you. See https://github.com/jupyterhub/zero-to-jupyterhub-k8s/blob/b676b255b4dd22e557857a7559cfd30d3b0ed519/CHANGELOG.md.

Thanks for your reply as well!

It had a “ContainerCreating” state rather than “pending”, right? A pod is pending before it has been scheduled.

The “Running” status I am referring to is the value of the top level “Status” key in this output:

Name:         hook-image-puller-xhjxx
Namespace:    default
Priority:     0
Node:         {REDACTED}
Start Time:   Mon, 25 Jan 2021 22:17:34 -0500
Labels:       app=jupyterhub
              component=hook-image-puller
              controller-revision-hash=b5cc67664
              pod-template-generation=1
              release=jhub
Annotations:  kubernetes.io/psp: eks.privileged
Status:       Pending
IP:           172.31.24.93
IPs:
  IP:           172.31.24.93
Controlled By:  DaemonSet/hook-image-puller
Init Containers:
  image-pull-metadata-block:
    Container ID:  docker://3b4c5552a2f7806181c3ce94ede207adbef468ae9768e09c345a9e1e56a59eb8
    Image:         jupyterhub/k8s-network-tools:0.11.1
    Image ID:      docker-pullable://jupyterhub/k8s-network-tools@sha256:e562b1f91469d71e843aeece0500be249344e82154183f082492c7c51a636b52
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/sh
      -c
      echo "Pulling complete"
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Mon, 25 Jan 2021 22:17:40 -0500
      Finished:     Mon, 25 Jan 2021 22:17:40 -0500
    Ready:          True
    Restart Count:  0
    Requests:
      cpu:        0
      memory:     0
    Environment:  <none>
    Mounts:       <none>
  image-pull-singleuser:
    Container ID:
    Image:         {REDACTED}
    Image ID:
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/sh
      -c
      echo "Pulling complete"
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Requests:
      cpu:        0
      memory:     0
    Environment:  <none>
    Mounts:       <none>
Containers:
  pause:
    Container ID:
    Image:          k8s.gcr.io/pause:3.2
    Image ID:
    Port:           <none>
    Host Port:      <none>
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Requests:
      cpu:        0
      memory:     0
    Environment:  <none>
    Mounts:       <none>
Conditions:
  Type              Status
  Initialized       False
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:            <none>
QoS Class:          BestEffort
Node-Selectors:     <none>
Tolerations:        hub.jupyter.org/dedicated=user:NoSchedule
                    hub.jupyter.org_dedicated=user:NoSchedule
                    node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                    node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                    node.kubernetes.io/not-ready:NoExecute op=Exists
                    node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                    node.kubernetes.io/unreachable:NoExecute op=Exists
                    node.kubernetes.io/unschedulable:NoSchedule op=Exists
Events:
  Type    Reason     Age   From               Message
  ----    ------     ----  ----               -------
  Normal  Scheduled  30m   default-scheduler  Successfully assigned default/hook-image-puller-xhjxx to ip-172-31-20-204.ec2.internal
  Normal  Pulled     30m   kubelet            Container image "jupyterhub/k8s-network-tools:0.11.1" already present on machine
  Normal  Created    30m   kubelet            Created container image-pull-metadata-block
  Normal  Started    30m   kubelet            Started container image-pull-metadata-block
  Normal  Pulling    30m   kubelet            Pulling image {REDACTED}

I’ll claim no expertise on how to read this output. I just noted that this pod has an event for “Pulling image …” but no following event like “Successfully pulled image …”


It certainly makes sense the issue in pulling the image could be related to the warnings about too many clients connecting to aws. Ill see if I can learn more about that.


With regards to disabling the pre-upgrade image pulling logic. Can you help me understand in what situations a user will have to wait for a new image to be pulled if this is disabled? Would it be the first time (and only the first time) a new user pod is requested applicable on a per node basis? Or does the continuous image puller prompt a node to get the latest image sometime after the upgrade finishes? Is the logic for pulling a new image to the placeholder node (we have 1 placeholder enabled) independent of the prePullerhookenabled setting?


Thanks for the heads up about the security vulnerability. We upgraded to helm chart 0.11.1 on the 22nd. Is there something in my post that suggested otherwise? (Just checking to make sure I am not missing something about our upgrade)