Fixed: node affinity mismatch stopping some pods from starting

Posting how I solved this in case others run into something similar.

I set up a JupyterHub w Kubernetes on Azure and had been using it with a small team of 3-4 for a year. Then I did a workshop to test it with more people. It worked great during the workshop. After the workshop, I crashed my server (ran out of RAM). No problem. That often happens and I restart. This time, I got a volume / node affinity error and the pod was stuck in pending. Some other people could still launch pods, but I could not.

Turns out it was a mismatch between the zone that my user PVC was on and the zone of the node. As the cluster scaled up during the workshop, new nodes on uswest2-1, uswest2-2, uswest-3 were created because I didn’t specify the zone of my nodes when setting up Kubernetes nodes. I only set the region: uswest2. As the cluster auto-scaled back down, it just so happened that the ‘last node standing’ was on uswest2-2. My user PVC is on uswest2-1 and so there was a pvc / node mismatch.

Debuging.

List the user PVCs

kubectl get pv -n dhub

Describe the user PVC that is not starting. The pvc-… bit is the PVC name for the user.

kubectl describe pv pvc-b2e50a00-df23-4513-b7ae-17f6cxxxxxx -n dhub

In the describe info, I see this

Node Affinity:     
  Required Terms:  
    Term 0:        topology.disk.csi.azure.com/zone in [westus2-1]

Take a look at any zones specified in the node specs. dhub is my namespace.

kubectl get nodes --show-labels -n dhub

I see this

topology.disk.csi.azure.com/zone=westus2-2

How did that happen? There is only one node and it is a different zone (2) than the user PVC.

I go onto the Azure dashboard and look at my node specs. I see that I did not check the box to restrict to a specific zone like uswest2-1. So during the workshop, the nodes were being created in different zones but same region.

Fixing

Fortunately the people who joined the workshop could be deleted.

I created a new node specification on Azure with the zone 1 box checked since the 3-4 people who had been on the hub all had PVCs on uswest2-1. Then I deleted all the new workshop participants.

These posts helped

1 Like

Specifics on how I deleted users. I ran

kubectl describe pv  -n dhub | grep -n westus2-2

to get the line number where the bad pvc appeared. The I ran a sed command to get the line 7 lines earlier. So if westus2-2 appeared on line 683, I ran

kubectl describe pv  -n dhub | sed -n '676p'

This showed me the user that needed to be deleted. I logged into the JupyterHub and went to the admin tab and deleted that user.

In the end only, hub-db-dir appeared on westus2-3 instead of westus2-1. I didn’t delete that. Hopefully that wasn’t related to my troubles.