Fixed: node affinity mismatch stopping some pods from starting

eeholmes · December 20, 2023, 5:37pm

Posting how I solved this in case others run into something similar.

I set up a JupyterHub w Kubernetes on Azure and had been using it with a small team of 3-4 for a year. Then I did a workshop to test it with more people. It worked great during the workshop. After the workshop, I crashed my server (ran out of RAM). No problem. That often happens and I restart. This time, I got a volume / node affinity error and the pod was stuck in pending. Some other people could still launch pods, but I could not.

Turns out it was a mismatch between the zone that my user PVC was on and the zone of the node. As the cluster scaled up during the workshop, new nodes on uswest2-1, uswest2-2, uswest-3 were created because I didn’t specify the zone of my nodes when setting up Kubernetes nodes. I only set the region: uswest2. As the cluster auto-scaled back down, it just so happened that the ‘last node standing’ was on uswest2-2. My user PVC is on uswest2-1 and so there was a pvc / node mismatch.

Debuging.

List the user PVCs

kubectl get pv -n dhub

Describe the user PVC that is not starting. The pvc-… bit is the PVC name for the user.

kubectl describe pv pvc-b2e50a00-df23-4513-b7ae-17f6cxxxxxx -n dhub

In the describe info, I see this

Node Affinity:     
  Required Terms:  
    Term 0:        topology.disk.csi.azure.com/zone in [westus2-1]

Take a look at any zones specified in the node specs. dhub is my namespace.

kubectl get nodes --show-labels -n dhub

I see this

topology.disk.csi.azure.com/zone=westus2-2

How did that happen? There is only one node and it is a different zone (2) than the user PVC.

I go onto the Azure dashboard and look at my node specs. I see that I did not check the box to restrict to a specific zone like uswest2-1. So during the workshop, the nodes were being created in different zones but same region.

Fixing

Fortunately the people who joined the workshop could be deleted.

I created a new node specification on Azure with the zone 1 box checked since the 3-4 people who had been on the hub all had PVCs on uswest2-1. Then I deleted all the new workshop participants.

These posts helped

eeholmes · January 4, 2024, 9:23pm

Specifics on how I deleted users. I ran

kubectl describe pv  -n dhub | grep -n westus2-2

to get the line number where the bad pvc appeared. The I ran a sed command to get the line 7 lines earlier. So if westus2-2 appeared on line 683, I ran

kubectl describe pv  -n dhub | sed -n '676p'

This showed me the user that needed to be deleted. I logged into the JupyterHub and went to the admin tab and deleted that user.

In the end only, hub-db-dir appeared on westus2-3 instead of westus2-1. I didn’t delete that. Hopefully that wasn’t related to my troubles.

Topic		Replies	Views
Moving PVCs when spawning Jupyterlab pod in a different zone Zero to JupyterHub on Kubernetes jupyterlab , jupyterhub , how-to , help-wanted	4	757	May 31, 2023
[Warning] 0/1 nodes are available: 1 node(s) didn't match node selector Zero to JupyterHub on Kubernetes	4	23991	December 26, 2019
JupyterHub deployed on Kubernetes cannot spawn users Zero to JupyterHub on Kubernetes jupyterhub , help-wanted	2	630	January 16, 2023
Jupyterhub Pods all going to only one node on the cluster Zero to JupyterHub on Kubernetes	10	1673	September 5, 2023
Hub/proxy node affinity JupyterHub	6	1294	February 12, 2022

Fixed: node affinity mismatch stopping some pods from starting

Debuging.

Fixing

Related topics