Hosting JupyterHubs - Any tips for new admins?

Hi folks! :wave:

I’m on a project where the task is to provide a platform for multiple users to access data - to me that says JupyterHub! I only really have experience of JupyterHub through BinderHub though, so I’m looking for some tips on how to set my Hub up to be as flexible as possible for my users without giving the default answer of “install X yourself”. I’ve deployed the Littlest JupyterHub onto a pretty sizeable VM with a large disk attached for data storage (we’re expecting 10 TB/yr for 3 yrs). I’m looking for advice on the following:

  • How can I provide all my users with more than just Python? Maybe a basic install of R and other popular languages?
  • Can I programmatically download data into the shared directory for TLJH users? For instance, where we’re pulling the data from has webhooks.
  • Can I add a whitelist of users to allow access to? Answer: http://tljh.jupyter.org/en/latest/topic/tljh-config.html#user-lists
  • Anything else those of you experienced in hosting Hubs think might come in handy!

I probably just need pointing in the direction of the right documentation. Thanks a lot! :tada:

3 Likes

Some thoughts below and one question: why did you decide to go with TLJH? The data amount sounds pretty big (close to 30TB by the end) which makes me think people will also need a lot of compute to process it (or maybe the compute will happen away from the hub itself via batch jobs?). So I am curious to know what the tradeoffs were.

I would investigate TLJH with dockerspawner. I think I’ve seen someone post a config example on this forum or the TLJH repository.

My thinking is that people who ask for R and other languages will also soon be asking for RStudio or VS Code as UI in addition to Jupyter. Providing those is easy if you can use a docker container (jupyter-server-proxy :wave:). Without a docker container to provide the separation between users you end up with people being able to access each others RStudio sessions.

Another (potentially) nice thing about using docker would be that you can setup a git repository to represent the user environment, build it with repo2docker, push it to a registry, update the docker tag on your TLJH. Builds would not be on demand but only when there was a change. That way the users of the hub can get involved more directly with installing new software/libraries into the user env. Instead of having to email an admin who then has to find time to find the package and add it and deploy it and not get distracted by some other task while doing all that.

1 Like

The use case is a small number of researchers in the first year, and most of them are asking “Can I still download the data?” so I think most of the compute will be happening away from the Hub. But at least the Hub provides a relatively simple UI to explore the data before downloading it. Most important aspect is authentication as the data is currently classified as “commercially sensitive, to be made public after year 1”.

Fantastic! I will try to hunt down that config.

Interesting idea, automating the admin flow of making packages accessible to all :thinking: I like it!

1 Like

This looks like a relevant issue:

1 Like

So I’m guessing the config would look similar to this, but with DockerSpawner instead of GallerySpawner? voila gallery (I’d be using GitHub authenticator eventually)

That is what I’d try first. It looks like GallerySpawner customises a small part (the command) of DockerSpawner but is otherwise the same.

1 Like

How can I provide all my users with more than just Python?

For the most part, these can be installed following the instructions for the given languages, e.g. apt-get install r or creating new environments. Checking out what repo2docker does can be a good reference.

The advantage of tljh is that you have one shared host environment that you can keep up to date. The disadvantage is that the shared environment is really shared, unlike a container-based approach (DockerSpawner or z2jh). So users likely don’t want to be able to install things in the share environment, but that also means they need to know how to do ‘user’ installs, such as pip install --user or conda create ....

Can I programmatically download data into the shared directory for TLJH users?

Yes, definitely. The question is when would you like to do this, and where?

  • On launch? Via pre-spawn hook
  • Via. a button? Notebook extension
  • Just periodically to a shared location? Cron job, perhaps

Anything else those of you experienced in hosting Hubs think might come in handy!

The biggest advantages in my experience for z2jh vs tljh have to do with the more elastic resources.

  • z2jh makes it a lot easier to change how much resources each user gets, since it’s just a config file and helm upgrade
  • z2jh can save you a bunch of cost if you have a small number of users who work infrequently, since it can scale down to a tiny node running just the Hub
  • a z2jh deployment is usually easier to recreate if you need to, since tljh doesn’t record all the admin tasks you might have done on the VM

If you go with z2jh over tljh, I’d recommend a continuous deployment setup like we use in mybinder.org-deploy, which makes automating stuff pretty nice. I usually don’t do this myeslf, though, instead maintaining a repo with a Makefile for doing deploys, but tracking your helm config and commands in a repo is still a good plan.

1 Like

Thanks for this, it’s really given me food for thought! I may reconsider the deployment.

So we are open sourcing a package that basically automates z2jh on multiple clouds services and also sets up autoscaling dask clusters and has shared folders.

It will eventually live here in a few weeks.

I also shared a presentation about it at the Dask Developers Conference yesterday. Slides here show what it can and cannot do.

Designed to be low maintenance and all state is held in a github repo.

4 Likes

Looking forward to seeing something appear in the repository :slight_smile:

Are there things that could be upstreamed to Z2JH?

I am using TLHJ on AWS. I have the whole deployment automated, including placing the Hub server on a private subnet behind an elastic load balancer. It is a one click deployment and includes a separate volume for data that users can shared.

Thanks for the feedback everyone! I decided to go with z2jh in the end. A couple of follow-up questions:

  1. Is there a step-by-step guide somewhere on how I can mount an SSD containing data to the Hub such that this is read-only access for all users?
  2. Can you point me in the direction of good “Intro to Jupyter Notebooks/Lab for Scientists” resources? We’re going to demo the platform to the project in 3 weeks time and I suspect not all of them will be familiar with the Jupyter environment.

Thanks! :sparkles:

If the data you want to share only has to be read by the users (no write access) then I think you can take an ordinary persistent volume and mount it with access mode ReadOnlyMany in all the user pods.

The ReadWriteMany access mode that allows lots of pods to read&write at the same time is more tricky. On GKE I’d go with https://cloud.google.com/filestore/docs which you can then mount read-write many in the pods. Advantage: mostly just works, someone else has to take care of the NFS server. Disadvantage: minimum size of a file store is 1TB which comes to about $200/month. An alternative is to run your own NFS server and then mount that share in the pods read-write many. I think there are a few examples of config scattered in this forum or the Z2JH issue tracker, but it seems to generally be a bit fiddly.

If you find something for (2), let me know :smiley:

OK, I’ll look into this https://docs.microsoft.com/en-us/azure/aks/azure-nfs-volume

1 Like

So I got a file share up and running with the following docs: https://docs.microsoft.com/en-us/azure/aks/azure-files-volume :tada:

Where does this lost+found directory come from, anyone know?

It’s a standard directory that appears if you have an ext* formatted filesystem on Linux (may also appear on some other Unix filesystems too): https://wiki.gentoo.org/wiki/Knowledge_Base:What_is_the_lost%2Bfound_directory%3F

You can safely ignore it :grinning:

1 Like

I cannot wait for this :slight_smile:

Has anybody securely mounted persistent volumes to Kubernetes/JupyterHub? For example, using Kerberos authentication and/or SMB?