Opening up a conversation topic for deploying BinderHub on HPC. This came up in conversation earlier today with Guillaume Eynard-Bontemps.
- Are there interested parties?
- What are there use cases?
- What would it take to implement such a thing?
Opening up a conversation topic for deploying BinderHub on HPC. This came up in conversation earlier today with Guillaume Eynard-Bontemps.
BinderHub as the python package existing today will probably never run on traditional HPC systems. However, if you break it down to components:
These two can be achieved in HPC style systems. For example, (1) could be accomplished by adding a backend for img to repo2docker. (2) probably is going to vary by installation and need to be pretty customized, unfortunately.
So, I believe the workflows we enable with BinderHub should be portable to HPC environments. I don’t think BinderHub the piece of software we use is the right way to do it.
Typically I mean a big shared cluster with many compute nodes, a shared file system, etc. At NCAR we have a few machines that fit this bill, the largest of which is called Cheyenne.
Some challenges beyond the lack of kubernetes that I see are:
I want to let Guillaume describe his use case though because its likely closer to reality than mine…
I can see use case where facilities and universities want to offer a BinderHub instance to supported reproducible research/teaching/…, and the resource they have is a ‘traditional HPC system’. Instead of using kubernetes to run the created containers, they may want to do this on nodes that are allocated on demand via slum (because that fits into the existing model). Of course this would required that the allocation of nodes is near-instant, so there would have to be a dedicated queue for such resource requests, but that’s probably a standard operation.
@fangohr pretty well discribed it! The idea is clearly to offer easy reproducible examples (on using HPC, data analysis with Pangeo, or anything doable within a notebook) and easy training experience with a lot of people.
I’ve already instanciated a Jupyterhub on our cluster. It is started on a Virtual Machine having direct access to our Jobqueue system (PBS Pro, but it would be the same with Slurm or any other) and our shared file system. It spawns notebook server on PBS using batchspawner. Currently, for sharing notebook and reproducing execution, we are sharing conda environment on the file system, an interested user will then copy or make a link to an ipython kernel spec pointing to this env, clone the notebook repository, and run it through Jupyterhub.
As @yuvipanda mentioned there are the two phases of image (or environment) building from a repo, and spawning a notebook server in this image/environment.
I am under the impression that Jupyterhub already has several ways of spawning interactive notebook in several environment. Is it, or could this be reused by Binder? At least we can build on top of this experience.
Could we then define an equivalent interface for the building image/env part? Having an EnvironmentBuilder API or something like this?
To be clear, I have currently two ideas to deploy a BinderHub on our system:
Binder does (and will continue to) assume a container-based workflow, I think. I don’t think we are prepared to relax that assumption, so to do this, the deployment must be able to launch docker images somehow.
The only place where BinderHub software right now technically assumes Kubernetes is in the launching, so we might be able to relax that, or you could run a dedicated, single-node Kubernetes that just serves Binder itself and the build process.
BinderHub talks to JupyterHub via its API, so any JupyterHub installation that can launch specified images can technically be used with BinderHub now.
The alternative to this is to have a traditional JupyterHub installation and implement this environment-builder step into the existing Spawner options form. Then building the environment becomes the Spawner’s responsibility. Something very much like Binder can be built purely as a custom Spawner that takes a repo (and potentially resource requests) as an input and does everything from there. If container-based launches is not an option, then I think this is a better approach than trying to use anything from Binder, ~all of which is about containers.
We could try to have a generic EnvironmentBuilder API, but since it’s tied so tightly to the Spawner implementation (unless both sides are still forced to assume docker images), implementing it outside the Spawner may not offer a lot of benefit at this point.
So what would be the principal steps or components for building this new Spawner (your alternative solution)?
Can we rely on some binder code for the first two bullets?
We could call that BinderBatchSpawner!
For step two you can reuse https://repo2docker.readthedocs.io/en/latest/, it removes the complexity of creating the environment in which a piece of software can be executed. It is the tool BinderHub uses to create the environment this means that the
BinderSpawner would be fully compatible.
The custom spawner approach is the approach Everware took. Check out the (very old) code: https://github.com/everware/everware/blob/aefa4a993da6ea11b22122c04ebec04d700835ad/everware/spawner.py This worked well but if I had to build it again I would create a small web frontend that uses repo2docker to which you can submit a “build the image” request. Then dynamically populate the spawner options form in JupyterHub from the list of images built by that first web page.
The main reason to split things up is that it was very awkward to have the spawn process last 10, 20, 30 minutes.
I think if you can’t spawn a container but only (say) a conda environment it would be polite to not call the spawner
Binder*Spawner as it would create the impression that a repository that works on BinderHub would also work, which it won’t.
Do you think it might be possible to have a subset of the repo2docker standard that was not as dependent on a container-based workflow? Most repositories I run across (obviously not a great statistic) seem to just have requirements.txt, environment.yml and sometimes a postBuild. apt.txt seems to be less common (and is the only one that necessitates a container). Having a repo2docker --no-docker flag to build the environment locally would potentially provide a solution this HPC issue. Longer term it might also be a useful approach for hosted jupyter environments like colab/kaggle where the current standard is just !pip install-ing everything. My (poorly thought out) dream would of course be to just have a magic cell at the top like %repo2docker https://github.com/kmader/n5 and have everything just work.
What benefit does
repo2docker --no-docker https://example.com/some/repo give you over
conda create -f environment.yml? (My answer is convenience.)
I agree that most Python related repos use one of
requirements.txt which don’t require use of a container. I also think
repo2docker --no-docker ... would be pretty confusing “this with docker but now without docker???1!!?” It would also not work when people want R or nix :-/
An interesting thing to do would be to create
repo2conda-env that works like repo2docker but uses a conda environment instead of a container and only supports a subset of the files repo2docker knows about. It would give you the same convenience but be clearer in terms of what it does (I think).
Yes, naming was never my thing, I was just interested in the idea. repo2conda-env is much better. While postBuild for everything would be difficult to support, it would be nice to have downloading of data or pretrained models and organization of folders supported.
FWIW, @yuvipanda mentioned that eventually we’d like something like this for The Littlest JupyterHub. Not sure if it makes sense to be a part of repo2docker, but what it could do is use the repo2docker config file specification (I know an official specification doesn’t exist yet) to go from repo -> envt
I think that repo2docker is two things:
So while I don’t think it’s appropriate at this point in time to support other build contexts in repo2docker itself, alternate implementations should be able to implement the same installation procedures based on repo2docker’s documentation (or at least the relevant subset).
So if there’s an action item for repo2docker here, it’s to make sure that it is clear and well-documented exactly what files we look for and what we do when we see them (I think we do the first part already, but maybe leave the second a little implicit). That way, another implementation, e.g. repo2conda, can implement its installation based on the spec, and clearly state what subset of the repo2docker environment spec is supported.
We should start writing up & discussing an actual specification soon.
FWIW There’s a similar issue here on this- https://github.com/jupyterhub/binderhub/issues/733 they don’t have HPC but they’re an academic group with their own servers.
I’ve heard a number of people interested in deploying their own academic binderhubs and don’t have access to Kubernetes. It’d be a good help if we could figure out documentation to make this easy.
Hi! I wish to revamp this conversation, unless there’s a better place I am not aware of, as I think the topic must still be relevant.
I don’t know the rationale with which BinderHub has been built, but given how inclusively the rest of the Jupyter ecosystem has been built I find surprising that BinderHub has been hooked so tightly to specific technologies - Helm and Docker - which according to this thread seems very hard to change.
Is this still the case? Is there any chance we can move forward with maintaining the goals but extending the support to any other technology?
“The primary goal of BinderHub is creating custom computing environments that can be used by many remote users”. For me it naively means running locally on my Linux machine, or on HPC environment, or on a pre-configured cloud server, and more.
Am I in any way misunderstanding the goals here?
I think this is the right place for this discussion.
Which particular technologies, if any, did you have in mind? Which parts of “Binder” do you feel like would benefit from working with those technologies as well as those they use right now? In what kind of situations do you end up wanting for “Binder” but can’t use it because of the technology it uses/depends on?
I think we need to disentangle some of the tech. There are (at least) two things:
Despite the name repo2docker we have been trying very hard to make the “docker” part of the name a implementation detail. For example we say that the intermediate
Dockerfile which repo2docker generates (and sometimes prints) is an implementation detail that could go away at any moment. This is meant to keep the door open for us to one day switch from docker to some other container image format/run time. Or maybe add an additional one (there is a PR aiming add adding podman as an example of this. PRs #848, #806). So with 20/20 hindsight a better name would have been
repo2container or some such.
BinderHub is “nothing but” a small web interface that launches repo2docker and then talks to some infrastructure to get the image scheduled, enforce network policies, resource constraints, etc. Most of the heavy work is done by repo2docker (building the container image) and by the infrastructure used to schedule the containers.
I don’t know if it makes sense to implement an alternative “backend” in BinderHub or start a new project “BinderHub-on-some-other-container-scheduler”. The answer depends a bit on what this other-container-scheduler would be I think.
I think that repo2docker will always need some form of container runtime. A simple
chroot or BSD jail are probably not enough to provide the separation we need and ability to base the “image” on ubuntu. So if people would like to have other container runtimes/formats supported then that would be worth sending Pull Requests to explore what that would look like.
So right now Binder (as the sum of BinderHub and repo2docker) rely on docker and kubernetes. These are (seen globally, average over all users) the market leaders or at least have the largest number of users. This is why we started with those first. Growing beyond that could/should be possible but how much effort that would require depends on what those other technologies are.
One thing I don’t know of is something as powerful as Kubernetes when it comes to scheduling, separating, resource constraining, etc “container like things” on a multi-node cluster of VMs. It would be cool to learn more about what exists, especially if it isn’t based on docker containers.
Approaching the problem from a slightly different angle: there are plugins to the littlest JupyterHub (for example https://github.com/plasmabio/tljh-repo2docker) which allow you to build and launch things based on repo2docker without needing kubernetes. It is aimed at smaller scale (in terms of number of users and images) than a full scale BinderHub. As a result it is more friendly to other use cases like teaching.
In addition to what @betatim said you might find this forum thread interesting:
I think it’s also worth stepping back and asking “What is your ultimate goal?”, as this may affect whether modifying BinderHub is the right choice, or whether we’re looking for a new application inspired by BinderHub but with much bigger aims.
Do you want a way to dynamically spin up and destroy environments with automatic installation of dependencies? Are these environments conda environments, containers, VMs, or something else? Does it have to be fully automated, or is it ok to give the user a blank environment and tell them to run
repo2something to install some dependencies?
Two of the big advantages of binderhub are it requires very little knowledge on the part of the user to obtain an environment where they can run stuff, and its reproduciblity. Is this what HPC users want, or are the limitations of a fixed disposable environment likely to be frustrating given they probably have above average coding skills?
Edit: Another interesting thread:
Guys, thanks for explaining your points of view.
Let me talk about my particular case then, hopefully you can suggest how your work can relate to that.
I don’t know the details of your approach, so pardon me if I am saying anything sloppy here.
We have a JupyterHub setup, which allows users to spin up notebook servers on a HPC environment. With our setup kernels are provided in form of conda environments, are integrated into the backend, we have no need for containerisation, no need for creating/destroying environments and resources.
In other words, JupyterHub allows this configuration: users select resources, JupyterHub starts the JupyterLab server on the HPC, kernels are available, data are accessible.
Now, we also have an internal git repo, and I am exploring the idea of starting JupyterLab sessions from any repository. IMHO this is exactly what BinderHub is all about.
AFAIU I “only” need a system which clones the repository in question, and spawns a JupyterLab session as JupyterHub would do. BinderHub becomes “a JupyterHub which starts from a repo”.
It’s a service with capabilities in common with JupyterHub (i.e. spawning JupyterLab somewhere), and an additional layer for accessing specific repositories.
In my case, this would mean no need for containers, no need for Kubernetes.
I wish to emphasise that BinderHub sounds to me the right idea for a certain kind of sharing, and I thank you about that.
But I don’t understand how it differs from JupyterHub and I would prefer if it would not limit the infrastructure choices, the same way JupyterHub doesn’t.
Features such as creating environments on the fly, although interesting in several cases, might be better off not being integral part of the main architecture.
Makes any sense?