[ANN] JupyterHub on Hadoop Deployment Guide


I just released a new guide for deploying JupyterHub on a Hadoop cluster: https://jcrist.github.io/jupyterhub-on-hadoop/.

The guide is written in the spirit of zero-to-jupyterhub-k8s, but for deploying on Hadoop. It includes instructions on basic installation (currently only manual instructions, help wanted for more pre-packaged ways), as well as common customizations you’d want to do.

A benefit of this setup over running JupyterHub on the edge node is that user’s sessions are also distributed throughout the cluster, allowing resources to scale with usage - no single node is under high load. It also integrates well with the rest of the hadoop ecosystem - you can run Dask and Spark directly (no need for Livy), and have full login access to other resources like HDFS. Conda/virtual environments can be installed on every node, or (more typically) centrally managed as archives on HDFS (using conda-pack/venv-pack).

A walkthrough video is here and docker-compose demo are both available in the docs. Hopefully this guide can prove useful for others. If you have questions or are interested in contributing, feel free to reach out on github: https://github.com/jcrist/jupyterhub-on-hadoop/