Pending hook-image-puller pods

We have recently ran into a problem that has gotten worse and worse.
We run our own custom jupyterhub image (singleuser) in an Azure AKS Kubernetes cluster with autoscaling enabled. We deploy jupyterhub with helm with a new imagetag every time we built a new custom jupyterhub image.

The problem we have started to run into recently when we deploy with helm is that one or more hook-image-puller pods are stuck in pending state and not being scheduled. They should schedule and run and then dissappear.

kubectl describe pod:

The nodes doesn’t have any taints.

Should these pods trigger a scale up?

k8s version is 1.24.6, we are using helm version 3.11.1 and z2jh jupyterhub helm chart 2.0.0
“helm upgrade --install” is used during the deploy.

I am bumping this post

Do you only have this problem when the limitation is Too many pods, or do you also have the same problem when you hit the CPU/memory limit for a node?

@manics

Yes, this is the standard event message from “describe pod” on the pending pod. It is like the kubernetes scheduler is looking for pods to evict but can’t find one but don’t instead trigger a scale up so the pods can schedule on a new, fresh node?

Note:
There are other apps/services besied Jupyterhub running on this nodepool

I meant can you reproduce this problem by hitting the memory/CPU limit on the node, but when the pod limit hasn’t been hit? This will help to narrow down whether it’s a general problem that occurs with all resources, or if it’s only a problem when you hit the number of pods limit.

Could you also try and reproduce this on a seperate cluster with no other applications running, in case there’s some interaction with the non-Z2JH pods?

1 Like

@manics

That would be something to try out but I am not sure that there is a lack of resource problem. What is clear though is that the only node that hasn’t a hook-image-puller pod scheduled is maxed out with 30 pods which is the max in this nodepool. It seems that this prevents the pod to schedule on that node and not trigger a scale up?

It is clearly a number of pods problem. I just deleted a pod in a non jupyterhub related deployment on the node which the hook-image-puller can’t be scheduled on and voila it could no schedule. It seems like the hook-image-puller daemonset reiles on there is always free space for more pods on nodes and doesn’t trigger a scale up?