JupyterHub Crashes under high load


We have been having issues with our Hub crashing when 40+ students log in at once. I am aware that I should migrate deployments to use regional clusters for HA. Since that requires destroying my current cluster, I am wondering if there are other practices that could increase reliability.

For instance, I noticed that the resources allocated by default to the hub pod are

      cpu: 200m
      memory: 512Mi

Is that enough memory? Would increasing the memory be wise? Can we have more replicas of the hub pos?

In case is helpful my cluster set up is:

# create core pool
gcloud beta --project=$PROJECT_NAME container clusters create $CLUSTER_NAME \
    --machine-type=n1-highmem-4 \
    --num-nodes=1 \
    --enable-autoscaling \
    --enable-autorepair \
    --min-nodes=1 \
    --max-nodes=4 \
    --cluster-version latest \
    --node-labels hub.jupyter.org/node-purpose=core

# create an user pool
gcloud beta --project=$PROJECT_NAME container node-pools create user-pool \
    --cluster=$CLUSTER_NAME \
    --machine-type=n1-highmem-8 \
    --num-nodes=1 \
    --enable-autoscaling \
    --enable-autorepair \
    --min-nodes=1 \
    --max-nodes=10 \
    --node-labels hub.jupyter.org/node-purpose=user \
    --node-taints hub.jupyter.org/dedicated=user:NoSchedule

My HUB_VERSION=v0.9-dcde99a

Any advice is highly appreciated. Thanks!

What do you see when you kubectl describe pod <nameofthehubpod>? In the “Last State” section it will tell you when and why the hub pod crashed. If it is because it doesn’t have enough RAM it will say something like OOMKilled as reason.

What do the last few lines of the hub pod’s log say? That might also contain hints as to why the hub crashes.

The last state is not very useful:

Last State: Terminated
Reason: Error
Exit Code: 1
Started: Fri, 04 Oct 2019 09:25:19 -0400
Finished: Fri, 04 Oct 2019 09:25:20 -0400

It does tell you that it wasn’t because the pod ran out of memory, so we can exclude that. It also tells you that the hub process crashed or exited because it was unhappy. To find out why you will need to inspect the logs of the pod from just before it crashed.