Big thank you to everyone who has contributed to the ZeroToJupyterHub documentation/project and this forum. We recently completed a transition from docker swarm to Kubernetes and these resources were invaluable. As a result our 1st year medical & dental graduate students are getting a great introduction to data science!
Although we were able to successfully use the playwright web testing framework to prove that our pre-existing docker swarm-based JupyterHub deployment could scale beyond 255 concurrent users, this did require us to artificially introduce multiple docker networks, and ultimately pushed us to make the long-contemplated switch to Kubernetes. For us this has meant migration to a cloud-based Typhoon installation.
In general the transition proceeded fairly smoothly but here are a few lessons learned / notes from along the way:
- The JH production helm chart is fairly old at this point so it may be appropriate to elect a more recent dev chart. We experienced no issues with the version we chose but mileage may vary.
- We are deploying in a CICD push model (as opposed to say Flux-based pull) with separate dev, test and prod environments. To keep configuration code DRY across environments we used Helm+Kustomize overlays using HelmChartInflationGenerator to process values.yml specifics. This has worked well but note HelmChartInflationGenerator is deprecated and has lots of gotchas - does anbody have a better solution yet?
In terms of cluster pre-requisites we installed:
- aws-ebs-csi-driver - plug-in for dynamic storage allocation (although we subsequently dropped it due to single-AZ and max instance volume mount EBS limitations)
- aws-node-termination-handler - we’re using spot nodes for JupyterHub user services and want to exit nodes from the cluster pseudo-gracefully.
- traefik - our choice for ingress because we are familiar with it from docker swarm.
For the JH deployment itself we abstracted all deployment differences to environment variables so that all jupyterhub_config.py snippets are constant and can be kept in the base kustomize helm values.yml. This allows a DRY configuration supporting different deployment configurations for:
- Dev/test/prod OAUTH configs
- User/deployment-specific KubeSpawner mutations (DFS sub-directory mounts, user-group-based specificities like memory/cpu etc)
- Commit-specific image references (we rebuild/push/scan our hub container image on each commit so need to reference those images)
- Environment-specific skinning (templating/colorization/logos)
Note we had originally thought we would move user storage from our on-premises DFS host mounts to EBS and this worked well when prototyped with a single-AZ cluster but we ended up reverting to the DFS model so that our user worker nodes could source spot nodes from multiple AZs.
In our situation with very lightweight usage from students learning how to use JH for the first time and operating on small datasets:
- Burstable t-series AWS instances have proven a cost effective choice.
- t3.2xlarge provide consistent/stable performance for 50 concurrent users (our load testing simulations have all 50 students repeatedly recalculating all cells). Memory is over-subscribed in this model.
In terms of the finished product:
- HA - Not there yet. Losing either the single instance JH or JH proxy pods will cause user interruption on new and/or existing sessions respectively. Rescheduling of these will occur eventually in the case of node/AZ failure.
Pod name prefixing - This is not strictly necessary but we like to have our deployed pods prefixed with an environment designation because we believe this can guard against accidents (dev-, tst-, prod-). We were unable to achieve this with the JupyterHub kustomize overlays unfortunately and had to resort to a post-processing
envsubststep. There is a values.yml configuration variable intended for this kind of thing but when we tried it we found that there were in-built assumptions in some of the core JH jupyterhub_config.py scripts regarding the name of the proxy and other services that were failing for us. Since we are using a dev helm chart this may be a known issue that is still being worked out.
Overall, however, a great open source deployment experience! And one that produces more consistent/stable performance for end users under periods of heavy load as measured by our playwright load testing.