JupyterHub Crashes under high load

mirestrepo · October 2, 2019, 2:07pm

Hi,

We have been having issues with our Hub crashing when 40+ students log in at once. I am aware that I should migrate deployments to use regional clusters for HA. Since that requires destroying my current cluster, I am wondering if there are other practices that could increase reliability.

For instance, I noticed that the resources allocated by default to the hub pod are

  resources:
    requests:
      cpu: 200m
      memory: 512Mi

Is that enough memory? Would increasing the memory be wise? Can we have more replicas of the hub pos?

In case is helpful my cluster set up is:

# create core pool
gcloud beta --project=$PROJECT_NAME container clusters create $CLUSTER_NAME \
    --machine-type=n1-highmem-4 \
    --num-nodes=1 \
    --enable-autoscaling \
    --enable-autorepair \
    --min-nodes=1 \
    --max-nodes=4 \
    --cluster-version latest \
    --node-labels hub.jupyter.org/node-purpose=core

# create an user pool
gcloud beta --project=$PROJECT_NAME container node-pools create user-pool \
    --cluster=$CLUSTER_NAME \
    --machine-type=n1-highmem-8 \
    --num-nodes=1 \
    --enable-autoscaling \
    --enable-autorepair \
    --min-nodes=1 \
    --max-nodes=10 \
    --node-labels hub.jupyter.org/node-purpose=user \
    --node-taints hub.jupyter.org/dedicated=user:NoSchedule

My HUB_VERSION=v0.9-dcde99a

Any advice is highly appreciated. Thanks!

betatim · October 2, 2019, 4:39pm

What do you see when you kubectl describe pod <nameofthehubpod>? In the “Last State” section it will tell you when and why the hub pod crashed. If it is because it doesn’t have enough RAM it will say something like OOMKilled as reason.

What do the last few lines of the hub pod’s log say? That might also contain hints as to why the hub crashes.

mirestrepo · October 4, 2019, 1:36pm

The last state is not very useful:

Last State: Terminated
Reason: Error
Exit Code: 1
Started: Fri, 04 Oct 2019 09:25:19 -0400
Finished: Fri, 04 Oct 2019 09:25:20 -0400
…

betatim · October 5, 2019, 11:10am

It does tell you that it wasn’t because the pod ran out of memory, so we can exclude that. It also tells you that the hub process crashed or exited because it was unhappy. To find out why you will need to inspect the logs of the pod from just before it crashed.

Topic		Replies	Views
Scheduler "insufficient memory.; waiting" errors - any suggestions? JupyterHub	7	1958	August 28, 2020
Core component resilience/reliability JupyterHub	10	2025	September 11, 2020
How to save notebooks Zero to JupyterHub on Kubernetes	0	345	April 3, 2023
About server specifications required to use jupyterhub with 50 people JupyterHub	2	432	June 29, 2022
The Littlest JupyterHub freezes if a student has an error in their code JupyterHub jupyterhub , help-wanted	4	341	September 18, 2023

JupyterHub Crashes under high load

Related topics