Opinions on JupyterHub setup for students "cluster" of department

Hi,

I want to setup JupyterHub on a single workstation for all students of our department (~400, but we expect fewer students to use it). We want to give computing ressources, including 4 GPUs, to these students, so that they can run and test their code. In other words, we want to distribute the computing ressources to the students via JupyterHub instead of ssh and console, because we think this is easier to use for students, who just start learning to program. Since the workstation will function as a “cluster” it should be available at all times over a timespan of at least months. I read a lot of installation guides about JupyterHub and made up a configuration in my head, but there are still some questions. Also I am unsure, if this configuration makes sense.

  1. I would like to run JupyterHub with systemd with a special systemuser jupyterhub, as partialy explained in the github wiki and in the hard installation guide on readthedocs. The workstation is reachable only via our universities intranet, but I still dont like to run JupyterHub as root.
  2. The user authentication should leverage the Shibboleth authentication of the university. This allows to grant access only to students of our department, for this I want to use the SAMLAuthenticator. When JupyterHub is not run as root, how can this authenticator create new UNIX users on the system? I need UNIX users, so the users can save their own files permanently on the workstation. I am unsure if I need or not need to follow the steps outlined in https://jupyterhub.readthedocs.io/en/stable/reference/config-sudo.html. The following sentence confuses me:
    There are many Authenticators and Spawners available for JupyterHub. Some, such as DockerSpawner or OAuthenticator, do not need any elevated permissions.
    I am not concerned about starting the user servers, but about the creation of UNIX users.
  3. On this page https://jupyterhub.readthedocs.io/en/stable/reference/separate-proxy.html is explained, that all servers are not accessible anymore, when the Hub restarts. So I wonder how often this might happen due to config or update changes, and if therefore it is a good idea to run CHP separately. I think it is not much of configuration to run CHP separately, but I did not find a tutorial explaining how this could be accomplished explicitly. I was thinking to download the CHP docker image and start this docker image via systemd.
  4. We want to use the WrapSpawner. Then we will create different docker images (tensorflow-gpu, torch-gpu, just numpy/scipy on CPU, fenics, maybe also some other language) that can be chosen as servers. With the SystemUserSpawner of DockerSpawner it will be possible to make the /home/USERNAME folder available within the docker container. Using docker containers we can also set limits for CPU and memory usage for single users.
  5. Since the server will be running over a long period of time with maybe a lot of users, I was thinking to use PostgreSQL or MariaDB, instead of the SQLite. Would this be too much or is it a good idea?
  6. Since the workstation will only be used for JupyterHub we do not need any reverse Proxy like NGINX, that could ship a static website on another subdomain. Then we would not use Port 8000 for JupyterHub, but port 443. Also we need to do the SSL / HTTPS configuration in JupyterHub. On the other hand, using NGINX is not much overhead and easy to config and additionaly it would be possible to add a separate website in the future.

I am very thankful about opinions and thoughts on this setup. Maybe there are other documentations / ressources that I missed?

2 Likes

Hi Nik: I am doing something similar for a department at UNED (Long Distance Spanish University). I can try to provide you some help regarding PostgreSQL and nginx.

I have been trying to configure PostgreSQL (the version downloaded from my side is 11.7) but when I try to start JupyterHub with this DB, I have an error relating to the need of “psycopg2”. In addition, if I try to install this package, it is needed first to install libpq-dev but I am unable to download and install it following the instructions at https://pypi.org/project/libpq-dev/ so by the moment I am keeping SQLite.

Regarding nginx, if you decide to use it be aware of the following post: Nginx reverser proxy error in configuration template

All the best

Jose

Hi, thanks for your help. I managed to set some stuff up. I forgot to tell you that I am running on a CentOS 8, since I am restricted by our IT services to use this OS. So far this is what my setup is doing right now:

  1. CHP is running separately in a podman container via systemd under separate user
  2. NGINX is handling HTTPS and is reverse proxying. Here I needed 4 hours to figure out to enable some SELinux flag (https://stackoverflow.com/questions/23948527/13-permission-denied-while-connecting-to-upstreamnginx).
  3. jupyterhub uses MariaDB
  4. jupyterhub is still run as root via systemd

In the next days I will try to use the DockerSpawner and WrapSpawner with podman. If I understand the python code of these modules correctly, the containers started here are not run by the linux user, that is logged in, but by the user that runs the JupyterHub?

Right now I am thinking of separating the SAMLAuthentication from JupyterHub. I want to use some other webtool (I have not yet searched for it), that creates local Linux users when they succesfully authorized themselves via SAML (all the students). This includes that all this students have some disk space to save their own files to. After this they can normaly login into JupyterHub via the standard authenticator of Jupyterhub. In addition I can create separate Linux users for persons, that are not able to authorize via SAML (maybe lecturers, tutors that are not students). These persons will then have the ability to login to JupyterHub. Also this separates the useradd command from the jupyterhub user.

When I am done I will post my whole configuration steps somewhere…

Best
Niklas