Hosting JupyterHubs - Any tips for new admins?

sgibson91 · February 25, 2020, 2:55pm

Hi folks!

I’m on a project where the task is to provide a platform for multiple users to access data - to me that says JupyterHub! I only really have experience of JupyterHub through BinderHub though, so I’m looking for some tips on how to set my Hub up to be as flexible as possible for my users without giving the default answer of “install X yourself”. I’ve deployed the Littlest JupyterHub onto a pretty sizeable VM with a large disk attached for data storage (we’re expecting 10 TB/yr for 3 yrs). I’m looking for advice on the following:

How can I provide all my users with more than just Python? Maybe a basic install of R and other popular languages?
Can I programmatically download data into the shared directory for TLJH users? For instance, where we’re pulling the data from has webhooks.
Can I add a whitelist of users to allow access to? Answer: http://tljh.jupyter.org/en/latest/topic/tljh-config.html#user-lists
Anything else those of you experienced in hosting Hubs think might come in handy!

I probably just need pointing in the direction of the right documentation. Thanks a lot!

betatim · February 26, 2020, 6:48am

Some thoughts below and one question: why did you decide to go with TLJH? The data amount sounds pretty big (close to 30TB by the end) which makes me think people will also need a lot of compute to process it (or maybe the compute will happen away from the hub itself via batch jobs?). So I am curious to know what the tradeoffs were.

I would investigate TLJH with dockerspawner. I think I’ve seen someone post a config example on this forum or the TLJH repository.

My thinking is that people who ask for R and other languages will also soon be asking for RStudio or VS Code as UI in addition to Jupyter. Providing those is easy if you can use a docker container (jupyter-server-proxy ). Without a docker container to provide the separation between users you end up with people being able to access each others RStudio sessions.

Another (potentially) nice thing about using docker would be that you can setup a git repository to represent the user environment, build it with repo2docker, push it to a registry, update the docker tag on your TLJH. Builds would not be on demand but only when there was a change. That way the users of the hub can get involved more directly with installing new software/libraries into the user env. Instead of having to email an admin who then has to find time to find the package and add it and deploy it and not get distracted by some other task while doing all that.

sgibson91 · February 26, 2020, 9:53am

The use case is a small number of researchers in the first year, and most of them are asking “Can I still download the data?” so I think most of the compute will be happening away from the Hub. But at least the Hub provides a relatively simple UI to explore the data before downloading it. Most important aspect is authentication as the data is currently classified as “commercially sensitive, to be made public after year 1”.

Fantastic! I will try to hunt down that config.

Interesting idea, automating the admin flow of making packages accessible to all I like it!

manics · February 26, 2020, 2:27pm

This looks like a relevant issue:

sgibson91 · February 26, 2020, 3:22pm

So I’m guessing the config would look similar to this, but with DockerSpawner instead of GallerySpawner? voila gallery (I’d be using GitHub authenticator eventually)

betatim · February 26, 2020, 4:00pm

That is what I’d try first. It looks like GallerySpawner customises a small part (the command) of DockerSpawner but is otherwise the same.

minrk · February 26, 2020, 8:50pm

How can I provide all my users with more than just Python?

For the most part, these can be installed following the instructions for the given languages, e.g. apt-get install r or creating new environments. Checking out what repo2docker does can be a good reference.

The advantage of tljh is that you have one shared host environment that you can keep up to date. The disadvantage is that the shared environment is really shared, unlike a container-based approach (DockerSpawner or z2jh). So users likely don’t want to be able to install things in the share environment, but that also means they need to know how to do ‘user’ installs, such as pip install --user or conda create ....

Can I programmatically download data into the shared directory for TLJH users?

Yes, definitely. The question is when would you like to do this, and where?

On launch? Via pre-spawn hook
Via. a button? Notebook extension
Just periodically to a shared location? Cron job, perhaps

Anything else those of you experienced in hosting Hubs think might come in handy!

The biggest advantages in my experience for z2jh vs tljh have to do with the more elastic resources.

z2jh makes it a lot easier to change how much resources each user gets, since it’s just a config file and helm upgrade
z2jh can save you a bunch of cost if you have a small number of users who work infrequently, since it can scale down to a tiny node running just the Hub
a z2jh deployment is usually easier to recreate if you need to, since tljh doesn’t record all the admin tasks you might have done on the VM

If you go with z2jh over tljh, I’d recommend a continuous deployment setup like we use in mybinder.org-deploy, which makes automating stuff pretty nice. I usually don’t do this myeslf, though, instead maintaining a repo with a Makefile for doing deploys, but tracking your helm config and commands in a repo is still a good plan.

sgibson91 · February 27, 2020, 10:40am

Thanks for this, it’s really given me food for thought! I may reconsider the deployment.

dharhas · February 27, 2020, 9:27pm

So we are open sourcing a package that basically automates z2jh on multiple clouds services and also sets up autoscaling dask clusters and has shared folders.

It will eventually live here in a few weeks.

I also shared a presentation about it at the Dask Developers Conference yesterday. Slides here show what it can and cannot do.

Designed to be low maintenance and all state is held in a github repo.

betatim · February 28, 2020, 6:38am

Looking forward to seeing something appear in the repository

Are there things that could be upstreamed to Z2JH?

Ramon_Ramirez-Linan · February 28, 2020, 8:47pm

I am using TLHJ on AWS. I have the whole deployment automated, including placing the Hub server on a private subnet behind an elastic load balancer. It is a one click deployment and includes a separate volume for data that users can shared.

sgibson91 · March 4, 2020, 2:34pm

Thanks for the feedback everyone! I decided to go with z2jh in the end. A couple of follow-up questions:

Is there a step-by-step guide somewhere on how I can mount an SSD containing data to the Hub such that this is read-only access for all users?
Can you point me in the direction of good “Intro to Jupyter Notebooks/Lab for Scientists” resources? We’re going to demo the platform to the project in 3 weeks time and I suspect not all of them will be familiar with the Jupyter environment.

Thanks!

betatim · March 4, 2020, 6:14pm

If the data you want to share only has to be read by the users (no write access) then I think you can take an ordinary persistent volume and mount it with access mode ReadOnlyMany in all the user pods.

The ReadWriteMany access mode that allows lots of pods to read&write at the same time is more tricky. On GKE I’d go with https://cloud.google.com/filestore/docs which you can then mount read-write many in the pods. Advantage: mostly just works, someone else has to take care of the NFS server. Disadvantage: minimum size of a file store is 1TB which comes to about $200/month. An alternative is to run your own NFS server and then mount that share in the pods read-write many. I think there are a few examples of config scattered in this forum or the Z2JH issue tracker, but it seems to generally be a bit fiddly.

If you find something for (2), let me know

sgibson91 · March 4, 2020, 6:43pm

OK, I’ll look into this https://docs.microsoft.com/en-us/azure/aks/azure-nfs-volume

sgibson91 · March 9, 2020, 2:25pm

So I got a file share up and running with the following docs: https://docs.microsoft.com/en-us/azure/aks/azure-files-volume

Where does this lost+found directory come from, anyone know?

manics · March 9, 2020, 6:49pm

It’s a standard directory that appears if you have an ext* formatted filesystem on Linux (may also appear on some other Unix filesystems too): https://wiki.gentoo.org/wiki/Knowledge_Base:What_is_the_lost%2Bfound_directory%3F

You can safely ignore it

sterlinm · March 11, 2020, 7:45pm

I cannot wait for this

sgibson91 · March 19, 2020, 2:32pm

Has anybody securely mounted persistent volumes to Kubernetes/JupyterHub? For example, using Kerberos authentication and/or SMB?

Topic		Replies	Views
Deploying JupyterHub for Education discuss	18	5600	May 5, 2019
Scaleable JupyetrHub Deployments in Education (Teaching) Education jupyterhub	14	1847	June 21, 2021
Is there a free (even ad-supported) public JupyterHub available? General	25	4179	August 28, 2019
Performing actions on the behalf of users Zero to JupyterHub on Kubernetes	8	2337	June 10, 2020
How to use jupyterhub api for allocating individual single-user notebook JupyterHub jupyterhub , how-to , help-wanted	1	385	September 8, 2020

Hosting JupyterHubs - Any tips for new admins?

Related topics