Scheduling errors with z2jh 0.10.x

I wanted to bring attention to an issue I ran into yesterday after upgrading our z2jh deployment from 0.9.0 to 0.10.6.

tl;dr

There is a bug in kube-scheduler v1.19.2 used in z2jh 0.10.0 which can crash the user-scheduler pods. The fix is in kube-scheduler v1.19.5 and updating your helm chart is pretty easy to pick it up:

scheduling:
  userScheduler:
    enabled: true
    replicas: 2
    image:
      tag: v1.19.5

Details

The details about how this manifested during a 2000 user pod scale test is documented in this z2jh issue: z2jh 0.10.6 scale test failure with kube-scheduler v1.19.2 · Issue #2025 · jupyterhub/zero-to-jupyterhub-k8s · GitHub

I’m still trying to figure out some issues with the scheduler in our scale/load tests but the initial fatal error: concurrent map writes kube-scheduler issue is resolved so the user-scheduler pods are no longer repeatedly crashing.

Since this may not come out in a z2jh 0.10.x patch release I wanted to let others be aware since it looked like at least one other person in Gitter chat was hitting a similar issue.

2 Likes

Really excellent work on this @mriedem! :heart: :tada: