Core component resilience/reliability

mriedem · August 17, 2020, 4:48pm

There are a couple of specific things I can point out here if you’re using zero-to-jupyterhub-k8s:

The hub API will return 429 responses with a retry-after header if you’ve hit the concurrentSpawnLimit. We see that happening at the start of a large user event so just make sure client side tooling can handle that 429 response and retry appropriately.
If you hit the consecutiveFailureLimit the hub will crash. Kubernetes should restart the hub pod but it does mean a restart of the hub and depending on how many users you have in the database and how your cull-idle service is setup, which runs on hub restart, the hub restart could take longer than you want. In our experience, as long as we have notebook images pre-pulled on the user nodes and have enough idle placeholders pre-created for a large user event, we don’t suffer from the consecutive failure limit issue. See [1] for more details.

Topic		Replies	Views
Jupyterhub Pod Dies on regular basis Zero to JupyterHub on Kubernetes jupyterhub , help-wanted	5	647	July 25, 2023
Long running kernels without browser activity Zero to JupyterHub on Kubernetes help-wanted	0	403	September 23, 2020
JupyterHub Crashes under high load JupyterHub	3	717	October 5, 2019
Best practices for jupyterhub JupyterHub how-to , help-wanted	9	163	December 12, 2024
Proxy loses track of singleuser servers after k8s restarts them JupyterHub	3	572	February 27, 2019