I wanted to bring attention to an issue I ran into yesterday after upgrading our z2jh deployment from 0.9.0 to 0.10.6.
There is a bug in kube-scheduler v1.19.2 used in z2jh 0.10.0 which can crash the user-scheduler pods. The fix is in kube-scheduler v1.19.5 and updating your helm chart is pretty easy to pick it up:
scheduling: userScheduler: enabled: true replicas: 2 image: tag: v1.19.5
The details about how this manifested during a 2000 user pod scale test is documented in this z2jh issue: z2jh 0.10.6 scale test failure with kube-scheduler v1.19.2 · Issue #2025 · jupyterhub/zero-to-jupyterhub-k8s · GitHub
I’m still trying to figure out some issues with the scheduler in our scale/load tests but the initial
fatal error: concurrent map writes kube-scheduler issue is resolved so the user-scheduler pods are no longer repeatedly crashing.
Since this may not come out in a z2jh 0.10.x patch release I wanted to let others be aware since it looked like at least one other person in Gitter chat was hitting a similar issue.