I want to setup JupyterHub on a single workstation for all students of our department (~400, but we expect fewer students to use it). We want to give computing ressources, including 4 GPUs, to these students, so that they can run and test their code. In other words, we want to distribute the computing ressources to the students via JupyterHub instead of ssh and console, because we think this is easier to use for students, who just start learning to program. Since the workstation will function as a “cluster” it should be available at all times over a timespan of at least months. I read a lot of installation guides about JupyterHub and made up a configuration in my head, but there are still some questions. Also I am unsure, if this configuration makes sense.
- I would like to run JupyterHub with systemd with a special systemuser jupyterhub, as partialy explained in the github wiki and in the hard installation guide on readthedocs. The workstation is reachable only via our universities intranet, but I still dont like to run JupyterHub as root.
- The user authentication should leverage the Shibboleth authentication of the university. This allows to grant access only to students of our department, for this I want to use the SAMLAuthenticator. When JupyterHub is not run as root, how can this authenticator create new UNIX users on the system? I need UNIX users, so the users can save their own files permanently on the workstation. I am unsure if I need or not need to follow the steps outlined in https://jupyterhub.readthedocs.io/en/stable/reference/config-sudo.html. The following sentence confuses me:
There are many Authenticators and Spawners available for JupyterHub. Some, such as DockerSpawner or OAuthenticator, do not need any elevated permissions.
I am not concerned about starting the user servers, but about the creation of UNIX users.
- On this page https://jupyterhub.readthedocs.io/en/stable/reference/separate-proxy.html is explained, that all servers are not accessible anymore, when the Hub restarts. So I wonder how often this might happen due to config or update changes, and if therefore it is a good idea to run CHP separately. I think it is not much of configuration to run CHP separately, but I did not find a tutorial explaining how this could be accomplished explicitly. I was thinking to download the CHP docker image and start this docker image via systemd.
- We want to use the WrapSpawner. Then we will create different docker images (tensorflow-gpu, torch-gpu, just numpy/scipy on CPU, fenics, maybe also some other language) that can be chosen as servers. With the SystemUserSpawner of DockerSpawner it will be possible to make the /home/USERNAME folder available within the docker container. Using docker containers we can also set limits for CPU and memory usage for single users.
- Since the server will be running over a long period of time with maybe a lot of users, I was thinking to use PostgreSQL or MariaDB, instead of the SQLite. Would this be too much or is it a good idea?
- Since the workstation will only be used for JupyterHub we do not need any reverse Proxy like NGINX, that could ship a static website on another subdomain. Then we would not use Port 8000 for JupyterHub, but port 443. Also we need to do the SSL / HTTPS configuration in JupyterHub. On the other hand, using NGINX is not much overhead and easy to config and additionaly it would be possible to add a separate website in the future.
I am very thankful about opinions and thoughts on this setup. Maybe there are other documentations / ressources that I missed?