Core component resilience/reliability

There are a couple of specific things I can point out here if you’re using zero-to-jupyterhub-k8s:

  1. The hub API will return 429 responses with a retry-after header if you’ve hit the concurrentSpawnLimit. We see that happening at the start of a large user event so just make sure client side tooling can handle that 429 response and retry appropriately.
  2. If you hit the consecutiveFailureLimit the hub will crash. Kubernetes should restart the hub pod but it does mean a restart of the hub and depending on how many users you have in the database and how your cull-idle service is setup, which runs on hub restart, the hub restart could take longer than you want. In our experience, as long as we have notebook images pre-pulled on the user nodes and have enough idle placeholders pre-created for a large user event, we don’t suffer from the consecutive failure limit issue. See [1] for more details.

[1] Optimizations — Zero to JupyterHub with Kubernetes documentation