Scaleable JupyetrHub Deployments in Education (Teaching)

In my higher ed institution, we are going from no hosted JupyterHub servers to at least three groups looking at them: central IT, School, and cottage industry.

All three are using Kubernetes to scale.

In terms of organisation, containers will be defined for the module (course) presentation level. Modules present once or twice a year with 300-1500 students per presentation.

I was wondering what strategies other people are using to deploy module/course based Jupyter environments to students.

Note that the following includes wide ranging questions that can be answered in general or specific technical terms. I’m just trying to make sense of what possible solutions there are, what folk are trying / have tried / use successfully / tried and will never return to again etc.

To try to categorise different sorts of approach, I can imagine the following sorts of deployment (there could well be others). They are likely to require different amounts of management / resource and reflect different institutional models for providing hosted lab services. There may well be very different costing models or consequences for background running costs etc, for different approaches:

  • JupyterHub created for each presentation of each module; this is probably the smallest atomic level: you only need one Docker image defined and users are limited to students (and staff) registered on a specific presentation of the course; at the end of the presentation, you shut it down and throw everything away;

  • JupyterHub created to cover all presentations of a single module: this might be appropriate if a module organiser or module team want to be able to manage the environment for their students, or someone wants to manage resource (or internal billing!) at the module level; this may require one use Docker image per presentation, with students perhaps being trusted to select the image relevant to their start date; student user accounts may need to be cleared out between presentations;

  • JupyterHub for several modules, perhaps in the same organisational unit (Faculty, School); this might appear if you have a unit based IT team who look after the IT needs within the school. Users are perhaps everyone who has signed up to a module presented by the unit in a particular academic year; images are required for every module or module-presentation; users may be restricted (how?) in terms of the images they can see / launch;

  • One JupyterHub to rule them all, centrally managed; maybe in excess of 10’s of thousands of registered users with accounts that last the lifetime of the student’s enrolment in the institution. Lots of images for lots of modules and/or module-presentations (so how do you limit which images which users can see, if only so as not to overwhelm them in the UI).

In each of the above cases, how do you go about:

  • mounting persistent user volumes; for example:
    • do users have a volume per module-presentation and have to take their files away at the end of the presentation?
    • do users have a personal filestore that is mounted into whatever environment they use?
    • does each module-presentation have its own filepath to stash files to try to help manage them (eg ~/{MODULE_CODE}-{PRESENTATION}.
  • enrolling users / managing permissions?

Inside the image, do you always use the same user account name (e.g. the Jupyter default is jovyan), or another name (user, student etc) or maybe you found a way to dynamically set a user account with a parameterised name when the container is launched (if so, how? And how is persistent volume mounting handled?)

1 Like

I can answer one bit of you post :smiley:

I’m assuming you’re using an external authenticator such as LDAP, OAuth, etc, where you can obtain a username and perhaps a UID. You can configure your docker image to switch to that user. For example

Persistent volume mounting already defaults to using a template based on the username

or have I misunderstood what you’re asking?

2 Likes

Re: the mounting of volumes, my naive understanding of default settings was that you mount onto a specified path in the container (eg /home/jovyan), so I was wondering how you would mount to eg /home/arbitraryuser if arbitraryuser was somehow created in a container when it’s launched based on eg a user’s single-sign on username.

Does that make sense?!

The mountPath is templated, so you can use /home/{username}:

You might have to fiddle with some other Z2JH or Kubernetes options to ensure the permissions are correct, it depends on exactly how you want things setup.

2 Likes

@manics

Thanks, that’s really useful. I’m essentially a customer of IT folk new to Jupyterverse so I need to know whether the claims of “you can have any container you want but you always need to mount to /home/joyvan” and “you can only mount one volume” etc are true or not!

From the above docs you linked to, my take away is:

  • I can specify the a path that incorporates a {username} variable;
  • I can mount multiple volumes into different parts of the container.

So presumably I could run a single user container that also incorporates eg a postgres db, and mount one volume on to a literally stated user home or more indirectly via /home/{username}, and another volume for the postgres data directory?

Is there best practice guidance for how to do a first run config of a mounted volume? For example, If I wanted to seed the database on container first/first volume mount, or copy some config, such as custom css or js, into /home/{username}/.jupyter ?

1 Like

The model I have used and am a fan of is “one JupyterHub per group of students all in the same kubernetes cluster”. A group is “all the people taking course A123 starting on 1st Feb 2023”. I think that maps to “one hub per presentation” in your lingo.

The reason I like it is that it decouples all the modules and presentations from each other. Even the same module starting at different times will end up being in tension with previous versions of itself (“want newer pandas now”, “we removed section X because we ran out of time”, or “no one uses tensorflow any more, we rewrote it in pytorch”, “we learnt we need to give people more RAM”, “we’ve updated our nbextensions but they now require jupyter lab”, etc, etc). These kinds of changes are hard to make if you need to coordinate them with previous presentations of the same module that are still running/available to students. They are probably nearly impossible to make if you need to coordinate them with staff from other modules with other priorities, needs, schedules, etc.

An additional benefit from setting up one hub per presentation is that it is now virtually free to setup a hub for that 4 day workshop, or a special lab session, or summer school, or what ever one off with special needs events you might have at your institution.

By sharing the same kubernetes cluster amongst all hubs you can spread the overhead costs of running a cluster amongst the modules/presentations. There is a bunch of monitoring and infrastructure stuff you need to run (grafana, prometheus, ingress, etc) and the resources used by this infrastructure are a much larger fraction of the cost if you are using it to support one hub that is only used for 4 hours on 2 days of every week compared to having 10 hubs with such usage patterns.

Another benefit from the student’s perspective is that one hub per presentation is much much simpler to use. The more options, forks in the road and questions you ask your students to answer in order to arrive at “the correct place to start studying” the more time you will have to spend figuring out where they went wrong and how to help them recover from it. Reducing the potential for going wrong on the side of the students is especially important if you have more than five students :smiley:

Hope this was helpful

3 Likes

As someone who doesn’t have to run JupyterHubs (I have deliberately avoided it which has had the effect of repeatedly shooting myself in the foot, effectively!:wink: I like that approach.

The 2i2c hubs, and things like @yuvipanda’s Hubploy, both look to me like they offer ways to manage the scaling in terms of numbers of hubs; the approach also has the advantage that a hands on instructor could be given their own hub to manage for a particular run of a course if they really wanted to have that level of control.

There are three approaches currently being trialled for us internally: a home brew solution for allowing arbitrary containers on a per student basis, currently testing with a container that runs a single user notebook, but doesn’t make use of JupyterHub, or even things like Enterprise Gateway (which to my mind would be a no brainer because the requirement for some notebook activities is a large GPU bit not for other activities…); a JupyterHub per course solution, where different images would be used for different presentations of each course and the infra provider have taken on management of all hubs; and a single JupyterHub with lots of images, one per module-presentation for a handful of light use-case courses, with the admin responsible for the whole hub.

In terms of numbers, in next couple years there’ll be of the order of five courses, some running once a year, some running twice a year, for 30 weeks x 4 hours practical study a week (distance ed.) in each case, for between 350 and 1500 students each. If we used Jupyer across the majority of modules in computing, it’d be maybe 3x on that. In the Faculty, maybe 10x on that.

I prefer the model of one infra provider making separate JupyterHubs available on request, not least because this then scales across the organisation (eg if a research group want a JupyterHub, or a business unit, or a residential school, or a workshop, or a demo, etc.). The infra provider can then try to do this efficiently across the institution as part of the more general k8s managed offering. The approach also scales at the group or smaller fleet level; eg a School or Faculty may choose to run their own k8s cluster and deploy the range of JupyterHubs they’re responsible for to that. Or there may be groups/sets of hubs associated with a particular course over multiple presentations, or in a particular department, but still all on the same central IT cluster.

One thing that is not clear to me is managing and organising persistent storage for students who may be on one or more module at any one time. This is probably really complicated if there are separate internal units running separate hub fleets on separate backends, especially if the same user has accounts across them. One proposal is a single filestore with individual course presentations mounting to /home/{user}/{module}-{presentation} and maybe volumes within that (eg additional mounts to /home/{user}/{module}-{presentation}/.postgres/data etc. Another issue is the extent to which different modules may want to set up notebook environments differently (eg in terms of extensions that are enabled and that teaching materials are in part written around). Managing a single source of branding truth is another issue. (A practical question would be where individual notebook servers look for branding and config across multiple courses. Another is how do we provide a consistent student experience if there are different configs, or at least manage expectations and confusion!)

The lots of hub model probably also fits various internal admin/management/responsibility/billing models. Role wise, I think there several things that need managing: looking after backend/infra; looking after the set up of one or more JupyerHubs in a fleet; creating images to run on the Hubs; managing and individual hub; managing the help desk for folk inside a container; managing the help desk at the hub level (lost passwords etc).

In a large org, these probably partition to backend, hubs/hub admin , and then managing issues inside each environment. There may be crossover in some responsibility eg in terms of who builds an image (a have a go instructor, or an image building support unit). User management is another issue (in a set of courses, it would make sense to have a process for auto enrolling relevant students via auth affinity groups, vs a user with admin privileges setting up accounts for a research group on an ad hoc basis, for example).

It strikes me that there is another model, which is student first, and each student is given their own hub on which images become selectable as the student takes the corresponding course.

Hmm…

That could simplify a lot of things… But I wonder what the downsides are?! I guess the biggest issue is the number of servers that would be required if students have 1 each. For 100k students, that could get expensive!

2 Likes

If each student has their own JupyterHub, that might create more isolation than wished. The integration of nbgrader into a JupyterHub is not usable or another idea I have previously played with was to allow students to share their notebooks with each other on a JupyterHub. This could be used for e.g. internal code reviews. You most likely will find a way how to configure each of the thousands of JupyterHubs to play along with each other but I guess it will be more effort.

1 Like

One way to initialise a user’s storage is to configure a hook: Customizing User Environment — Zero to JupyterHub with Kubernetes documentation
Your script could simply check whether the required content is present, and download it if not.

How much isolation do you need for different modules? If isolation at the Jupyter UI level is sufficient you could have a single persistent volume per user that gets mounted as their home directory for all modules, but configure your Docker images to use a subdirectory for each module as the working directory for JupyterLab. The content for all downloaded modules would be present in the user’s home but JupyterLab would present the subdirectory as the root so you could only see the rest of the content using a terminal.

JupyterLab can load configuration at both the system and user level, so you can have separate extension configurations built in to your Docker images as long as your users don’t try to override the config in their home directory.

3 Likes

I suspect the model we are likely to go for is for each module to have its own directory, but have the server run from home. This means we can have a common configuration for custom scripts etc. That said, some courses or students may want different settings/extensions for different courses, which would make running the single user notebook server from the course directory. That could mean differences in customisation which would give a variable experience across several modules.

For students who just take one course at a time (which is common in our distance ed setting) they are unlikely to experience more that one environment at any time. But as study patterns change, and uptake of Jupyter warez in courses, it becomes more likely that students may be working on two of more courses with their own jupyter environments at the same time. We won’t really get a feel for what the issues are, or what students might prefer, until we have the first one or two “dual” presentations…

2 Likes

I am quite curious regarding that! Is it possible for students to edit the settings and add/remove extensions in your current setup? This would enable students to remove any existing differences in the JupyterHub experience manually but they might also misconfigure their working space.

It’s still be be proven and the final set up defined. If a server is started against a student’s home dir, and settings are picked up from a config file in the $HOME path, then they will be able to install and run extensions that can be installed into a running server, and may be able to install extensions that require a server restart, eg by quitting a docker session and then launching a new new one.

My current take is that w should pre-install a whole set of extensions, and pre-enable some of them, allowing students then to customise the extensions they want to use further as they see fit. For an example of extensions we currently use, see OpenJALE.