We have been having issues with our Hub crashing when 40+ students log in at once. I am aware that I should migrate deployments to use regional clusters for HA. Since that requires destroying my current cluster, I am wondering if there are other practices that could increase reliability.
For instance, I noticed that the resources allocated by default to the hub pod are
resources: requests: cpu: 200m memory: 512Mi
Is that enough memory? Would increasing the memory be wise? Can we have more replicas of the hub pos?
In case is helpful my cluster set up is:
# create core pool gcloud beta --project=$PROJECT_NAME container clusters create $CLUSTER_NAME \ --machine-type=n1-highmem-4 \ --num-nodes=1 \ --enable-autoscaling \ --enable-autorepair \ --min-nodes=1 \ --max-nodes=4 \ --cluster-version latest \ --node-labels hub.jupyter.org/node-purpose=core # create an user pool gcloud beta --project=$PROJECT_NAME container node-pools create user-pool \ --cluster=$CLUSTER_NAME \ --machine-type=n1-highmem-8 \ --num-nodes=1 \ --enable-autoscaling \ --enable-autorepair \ --min-nodes=1 \ --max-nodes=10 \ --node-labels hub.jupyter.org/node-purpose=user \ --node-taints hub.jupyter.org/dedicated=user:NoSchedule
Any advice is highly appreciated. Thanks!