Big thank you to everyone who has contributed to the ZeroToJupyterHub documentation/project and this forum. We recently completed a transition from docker swarm to Kubernetes and these resources were invaluable. As a result our 1st year medical & dental graduate students are getting a great introduction to data science!
Although we were able to successfully use the playwright web testing framework to prove that our pre-existing docker swarm-based JupyterHub deployment could scale beyond 255 concurrent users, this did require us to artificially introduce multiple docker networks, and ultimately pushed us to make the long-contemplated switch to Kubernetes. For us this has meant migration to a cloud-based Typhoon installation.
In general the transition proceeded fairly smoothly but here are a few lessons learned / notes from along the way:
- The JH production helm chart is fairly old at this point so it may be appropriate to elect a more recent dev chart. We experienced no issues with the version we chose but mileage may vary.
- We are deploying in a CICD push model (as opposed to say Flux-based pull) with separate dev, test and prod environments. To keep configuration code DRY across environments we used Helm+Kustomize overlays using HelmChartInflationGenerator to process values.yml specifics. This has worked well but note HelmChartInflationGenerator is deprecated and has lots of gotchas - does anbody have a better solution yet?
In terms of cluster pre-requisites we installed:
- aws-ebs-csi-driver - plug-in for dynamic storage allocation (although we subsequently dropped it due to single-AZ and max instance volume mount EBS limitations)
- aws-node-termination-handler - we’re using spot nodes for JupyterHub user services and want to exit nodes from the cluster pseudo-gracefully.
- traefik - our choice for ingress because we are familiar with it from docker swarm.
For the JH deployment itself we abstracted all deployment differences to environment variables so that all jupyterhub_config.py snippets are constant and can be kept in the base kustomize helm values.yml. This allows a DRY configuration supporting different deployment configurations for:
- Dev/test/prod OAUTH configs
- User/deployment-specific KubeSpawner mutations (DFS sub-directory mounts, user-group-based specificities like memory/cpu etc)
- Commit-specific image references (we rebuild/push/scan our hub container image on each commit so need to reference those images)
- Environment-specific skinning (templating/colorization/logos)
Note we had originally thought we would move user storage from our on-premises DFS host mounts to EBS and this worked well when prototyped with a single-AZ cluster but we ended up reverting to the DFS model so that our user worker nodes could source spot nodes from multiple AZs.
Sizing/Scaling
In our situation with very lightweight usage from students learning how to use JH for the first time and operating on small datasets:
- Burstable t-series AWS instances have proven a cost effective choice.
- t3.2xlarge provide consistent/stable performance for 50 concurrent users (our load testing simulations have all 50 students repeatedly recalculating all cells). Memory is over-subscribed in this model.
In terms of the finished product:
- HA - Not there yet. Losing either the single instance JH or JH proxy pods will cause user interruption on new and/or existing sessions respectively. Rescheduling of these will occur eventually in the case of node/AZ failure.
-
Pod name prefixing - This is not strictly necessary but we like to have our deployed pods prefixed with an environment designation because we believe this can guard against accidents (dev-, tst-, prod-). We were unable to achieve this with the JupyterHub kustomize overlays unfortunately and had to resort to a post-processing
envsubst
step. There is a values.yml configuration variable intended for this kind of thing but when we tried it we found that there were in-built assumptions in some of the core JH jupyterhub_config.py scripts regarding the name of the proxy and other services that were failing for us. Since we are using a dev helm chart this may be a known issue that is still being worked out.
Overall, however, a great open source deployment experience! And one that produces more consistent/stable performance for end users under periods of heavy load as measured by our playwright load testing.