BinderHub for HPC


#1

Opening up a conversation topic for deploying BinderHub on HPC. This came up in conversation earlier today with Guillaume Eynard-Bontemps.

  • Are there interested parties?
  • What are there use cases?
  • What would it take to implement such a thing?

cc @yuvipanda @betatim and @guillaumeeb (Guillaume Eynard-Bontemps)


#2

BinderHub as the python package existing today will probably never run on traditional HPC systems. However, if you break it down to components:

  1. Dynamic Image Building from a repository
  2. Launching an interactive web application from inside the image

These two can be achieved in HPC style systems. For example, (1) could be accomplished by adding a backend for img to repo2docker. (2) probably is going to vary by installation and need to be pretty customized, unfortunately.

So, I believe the workflows we enable with BinderHub should be portable to HPC environments. I don’t think BinderHub the piece of software we use is the right way to do it.


#3

riffing off of what @yuvipanda said, I think in these conversations it’s important to disambiguate what we mean by “HPC”. @jhamman could you go into a bit more detail what you mean by HPC?


#4

Typically I mean a big shared cluster with many compute nodes, a shared file system, etc. At NCAR we have a few machines that fit this bill, the largest of which is called Cheyenne.

Some challenges beyond the lack of kubernetes that I see are:

  • Often compute nodes do not have access to the outside network
  • Managing a server attached to an HPC is not going to be popular from sys admins
  • Provisioning resources requires waiting in a job queue
  • HPC are rarely container friendly

I want to let Guillaume describe his use case though because its likely closer to reality than mine…


#5

I can see use case where facilities and universities want to offer a BinderHub instance to supported reproducible research/teaching/…, and the resource they have is a ‘traditional HPC system’. Instead of using kubernetes to run the created containers, they may want to do this on nodes that are allocated on demand via slum (because that fits into the existing model). Of course this would required that the allocation of nodes is near-instant, so there would have to be a dedicated queue for such resource requests, but that’s probably a standard operation.


#6

@fangohr pretty well discribed it! The idea is clearly to offer easy reproducible examples (on using HPC, data analysis with Pangeo, or anything doable within a notebook) and easy training experience with a lot of people.

I’ve already instanciated a Jupyterhub on our cluster. It is started on a Virtual Machine having direct access to our Jobqueue system (PBS Pro, but it would be the same with Slurm or any other) and our shared file system. It spawns notebook server on PBS using batchspawner. Currently, for sharing notebook and reproducing execution, we are sharing conda environment on the file system, an interested user will then copy or make a link to an ipython kernel spec pointing to this env, clone the notebook repository, and run it through Jupyterhub.

As @yuvipanda mentioned there are the two phases of image (or environment) building from a repo, and spawning a notebook server in this image/environment.
I am under the impression that Jupyterhub already has several ways of spawning interactive notebook in several environment. Is it, or could this be reused by Binder? At least we can build on top of this experience.
Could we then define an equivalent interface for the building image/env part? Having an EnvironmentBuilder API or something like this?

To be clear, I have currently two ideas to deploy a BinderHub on our system:

  1. Use a big VM, install docker on it, even Singlenode Kubernetes, and put Binderhub on top of it. I don’t really like this approach, and this will probably be hard to negociate with our security team.
  2. Make it closer to Jupyterhub. Use an environment builder (or HPC compatible container/image builder) which just automatically create conda env on the shared file system, and use something alike batchspwaner for starting a notebook inside this env/image.

#7

Binder does (and will continue to) assume a container-based workflow, I think. I don’t think we are prepared to relax that assumption, so to do this, the deployment must be able to launch docker images somehow.

The only place where BinderHub software right now technically assumes Kubernetes is in the launching, so we might be able to relax that, or you could run a dedicated, single-node Kubernetes that just serves Binder itself and the build process.

BinderHub talks to JupyterHub via its API, so any JupyterHub installation that can launch specified images can technically be used with BinderHub now.

The alternative to this is to have a traditional JupyterHub installation and implement this environment-builder step into the existing Spawner options form. Then building the environment becomes the Spawner’s responsibility. Something very much like Binder can be built purely as a custom Spawner that takes a repo (and potentially resource requests) as an input and does everything from there. If container-based launches is not an option, then I think this is a better approach than trying to use anything from Binder, ~all of which is about containers.

We could try to have a generic EnvironmentBuilder API, but since it’s tied so tightly to the Spawner implementation (unless both sides are still forced to assume docker images), implementing it outside the Spawner may not offer a lot of benefit at this point.


#8

So what would be the principal steps or components for building this new Spawner (your alternative solution)?
Something like:

  • build an option form close to mybinder.org one,
  • Implements an EnvironmentBuilder which takes its specification from a .binder folder inside the input git repo. It could be just a conda env creation.
  • Spawn a notebook as in batchspawner from here

Can we rely on some binder code for the first two bullets?

We could call that BinderBatchSpawner!


#9

For step two you can reuse https://repo2docker.readthedocs.io/en/latest/, it removes the complexity of creating the environment in which a piece of software can be executed. It is the tool BinderHub uses to create the environment this means that the BinderSpawner would be fully compatible.

The custom spawner approach is the approach Everware took. Check out the (very old) code: https://github.com/everware/everware/blob/aefa4a993da6ea11b22122c04ebec04d700835ad/everware/spawner.py This worked well but if I had to build it again I would create a small web frontend that uses repo2docker to which you can submit a “build the image” request. Then dynamically populate the spawner options form in JupyterHub from the list of images built by that first web page.

The main reason to split things up is that it was very awkward to have the spawn process last 10, 20, 30 minutes.

I think if you can’t spawn a container but only (say) a conda environment it would be polite to not call the spawner Binder*Spawner as it would create the impression that a repository that works on BinderHub would also work, which it won’t.


#10

Do you think it might be possible to have a subset of the repo2docker standard that was not as dependent on a container-based workflow? Most repositories I run across (obviously not a great statistic) seem to just have requirements.txt, environment.yml and sometimes a postBuild. apt.txt seems to be less common (and is the only one that necessitates a container). Having a repo2docker --no-docker flag to build the environment locally would potentially provide a solution this HPC issue. Longer term it might also be a useful approach for hosted jupyter environments like colab/kaggle where the current standard is just !pip install-ing everything. My (poorly thought out) dream would of course be to just have a magic cell at the top like %repo2docker https://github.com/kmader/n5 and have everything just work.


#11

What benefit does repo2docker --no-docker https://example.com/some/repo give you over conda create -f environment.yml? (My answer is convenience.)

I agree that most Python related repos use one of environment.yml or requirements.txt which don’t require use of a container. I also think repo2docker --no-docker ... would be pretty confusing “this with docker but now without docker???1!!?” It would also not work when people want R or nix :-/

An interesting thing to do would be to create repo2conda-env that works like repo2docker but uses a conda environment instead of a container and only supports a subset of the files repo2docker knows about. It would give you the same convenience but be clearer in terms of what it does (I think).


#12

Yes, naming was never my thing, I was just interested in the idea. repo2conda-env is much better. While postBuild for everything would be difficult to support, it would be nice to have downloading of data or pretrained models and organization of folders supported.


#13

FWIW, @yuvipanda mentioned that eventually we’d like something like this for The Littlest JupyterHub. Not sure if it makes sense to be a part of repo2docker, but what it could do is use the repo2docker config file specification (I know an official specification doesn’t exist yet) to go from repo -> envt


#14

I think that repo2docker is two things:

  1. a record of standards and best practices for specifying environments (ideally, close to nothing that’s actually specific to repo2docker)
  2. an implementation of automating the installation of those requirements via a Dockerfile

So while I don’t think it’s appropriate at this point in time to support other build contexts in repo2docker itself, alternate implementations should be able to implement the same installation procedures based on repo2docker’s documentation (or at least the relevant subset).

So if there’s an action item for repo2docker here, it’s to make sure that it is clear and well-documented exactly what files we look for and what we do when we see them (I think we do the first part already, but maybe leave the second a little implicit). That way, another implementation, e.g. repo2conda, can implement its installation based on the spec, and clearly state what subset of the repo2docker environment spec is supported.


#15

https://github.com/jupyter/repo2docker/issues/330 and https://github.com/jupyter/repo2docker/issues/386 has some discussion around specifying an actual standard.

We should start writing up & discussing an actual specification soon.


#16

FWIW There’s a similar issue here on this- https://github.com/jupyterhub/binderhub/issues/733 they don’t have HPC but they’re an academic group with their own servers.

I’ve heard a number of people interested in deploying their own academic binderhubs and don’t have access to Kubernetes. It’d be a good help if we could figure out documentation to make this easy.