When running helm upgrade, I get the following error some of the time but not all of the time:
Error: UPGRADE FAILED: release jhub failed, and has been rolled back due to atomic being set: pre-upgrade hooks failed: timed out waiting for the condition
This same issue had come up historically. Noting that the z2jh documentation here suggests increasing the timeout and that the pre image puller hooks is enabled, I set the upgrade timeout to be 30 minutes. The image being pulled is quite large (11 GB). This seemed to have improved things for a month or so and deploys worked every time.
Within the last week this issue has come back but it is not consistent. Sometimes the deploy is successful and takes just a few minutes. Other times it runs until it hits the 30 minute timeout. The fact that it takes just a few minutes when it does succeed makes me think increasing the timeout further will not help (and 30 minutes is long to pull an image…).
I have inspected nodes individually and have reason to believe they have sufficient disks space to accommodate the new image (old images are periodically removed).
Any thoughts on where I can keep investigating? Are there more verbose logs describing what goes on during the “pre-upgrade” step?
This is on AWS. Here is the (redacted) config file:
Ah I should also mention the Helm upgrade command is being run automatically from a GitHub action. Likely not relevant but just one more piece of information.
Can you try running something like kubectl describe pods whilst the upgrade is running? This may tell you whether one of the pods is having scheduling problems, and if so why.
Thanks for the suggestion. I ran kubectl describe pods during an upgrade and looked through the output. That was helpful to understand more of what is going on although I’m still not sure the underlying issue.
All of the pods except for one had a “Running” status. The one that did not was a hook-image-puller-xxxx pod and had a status “Pending”. Its events looked like
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 30m default-scheduler Successfully assigned default/hook-image-puller-xhjxx to {REDACTED}
Normal Pulled 30m kubelet Container image "jupyterhub/k8s-network-tools:0.11.1" already present on machine
Normal Created 30m kubelet Created container image-pull-metadata-block
Normal Started 30m kubelet Started container image-pull-metadata-block
Normal Pulling 30m kubelet Pulling image "{REDACTED}"
I’m not quite sure where else to look here for a clue as to why this one particular image pulling pod is having trouble. I checked the node and it has enough disk space.
Running the helm upgrade ... command a second time succeeds (this is a pattern where if it fails, a retry always has worked).
One other thing I noted is that several pods (of various types and across various nodes) reported a lot of events like:
W0125 22:48:38.673520 16604 exec.go:271] constructing many client instances from the same exec auth config can cause performance problems during cert rotation and can exhaust available network connections; 1001 clients constructed calling "aws"
This may just be a warning an unrelated to the issue with upgrading but it stood out.
All of the pods except for one had a “Running” status. The one that did not was a hook-image-puller-xxxx pod and had a status “Pending”. Its events looked like
It had a “ContainerCreating” state rather than “pending”, right? A pod is pending before it has been scheduled.
I think what is happening is related to…
W0125 22:48:38.673520 16604 exec.go:271] constructing many client instances from the same exec auth config can cause performance problems during cert rotation and can exhaust available network connections; 1001 clients constructed calling "aws"
And, that in turn influence the ability to pull the image.
If you experience an unreliable behavior, you can disable the pre-pulling logic which is meant to make sure that all nodes are ready to start quickly before the hub pod is upgraded to start making user pods actually make use of the new image rather than the old.
I’ll claim no expertise on how to read this output. I just noted that this pod has an event for “Pulling image …” but no following event like “Successfully pulled image …”
It certainly makes sense the issue in pulling the image could be related to the warnings about too many clients connecting to aws. Ill see if I can learn more about that.
With regards to disabling the pre-upgrade image pulling logic. Can you help me understand in what situations a user will have to wait for a new image to be pulled if this is disabled? Would it be the first time (and only the first time) a new user pod is requested applicable on a per node basis? Or does the continuous image puller prompt a node to get the latest image sometime after the upgrade finishes? Is the logic for pulling a new image to the placeholder node (we have 1 placeholder enabled) independent of the prePuller → hook → enabled setting?
Thanks for the heads up about the security vulnerability. We upgraded to helm chart 0.11.1 on the 22nd. Is there something in my post that suggested otherwise? (Just checking to make sure I am not missing something about our upgrade)