There are a couple of specific things I can point out here if you’re using zero-to-jupyterhub-k8s:
- The hub API will return 429 responses with a
retry-after
header if you’ve hit the concurrentSpawnLimit. We see that happening at the start of a large user event so just make sure client side tooling can handle that 429 response and retry appropriately. - If you hit the consecutiveFailureLimit the hub will crash. Kubernetes should restart the hub pod but it does mean a restart of the hub and depending on how many users you have in the database and how your
cull-idle
service is setup, which runs on hub restart, the hub restart could take longer than you want. In our experience, as long as we have notebook images pre-pulled on the user nodes and have enough idle placeholders pre-created for a large user event, we don’t suffer from the consecutive failure limit issue. See [1] for more details.
[1] Optimizations — Zero to JupyterHub with Kubernetes documentation