Scheduling errors with z2jh 0.10.x

mriedem · February 5, 2021, 8:51pm

I wanted to bring attention to an issue I ran into yesterday after upgrading our z2jh deployment from 0.9.0 to 0.10.6.

tl;dr

There is a bug in kube-scheduler v1.19.2 used in z2jh 0.10.0 which can crash the user-scheduler pods. The fix is in kube-scheduler v1.19.5 and updating your helm chart is pretty easy to pick it up:

scheduling:
  userScheduler:
    enabled: true
    replicas: 2
    image:
      tag: v1.19.5

Details

The details about how this manifested during a 2000 user pod scale test is documented in this z2jh issue: z2jh 0.10.6 scale test failure with kube-scheduler v1.19.2 · Issue #2025 · jupyterhub/zero-to-jupyterhub-k8s · GitHub

I’m still trying to figure out some issues with the scheduler in our scale/load tests but the initial fatal error: concurrent map writes kube-scheduler issue is resolved so the user-scheduler pods are no longer repeatedly crashing.

Since this may not come out in a z2jh 0.10.x patch release I wanted to let others be aware since it looked like at least one other person in Gitter chat was hitting a similar issue.

consideRatio · February 5, 2021, 8:53pm

Really excellent work on this @mriedem!

Topic		Replies	Views
New versions for Helm, 2.0.0 issues Zero to JupyterHub on Kubernetes	1	420	May 31, 2023
User-scheduler pod getting OOMKilled JupyterHub help-wanted	6	226	October 10, 2024
Server requested Spawn failed: did not start in 300 seconds Zero to JupyterHub on Kubernetes help-wanted	1	1192	April 20, 2022
Problem with Spawning Pods with Persistent Volumes in a Multi-Zonal GKE deployment Zero to JupyterHub on Kubernetes	4	460	April 28, 2021
Improved chp health endpoint? Zero to JupyterHub on Kubernetes jupyterhub	3	717	May 26, 2021

Scheduling errors with z2jh 0.10.x

tl;dr

Details

Related topics