GitHub Actions + Binder

:wave: Hello friends, I work at GitHub and I am exploring open source projects that can help make the notebook experience better (aside from the rendering issues which are being worked on) such as sharing & collaboration. One idea I have is to promote more usage of Binder with the following methods:

Concretely, these are the problems I’m trying to solve:

  • Enable public/private sharing of notebooks via binder. I’m not completely sure about how to facilitate the private case with various usernames and logins etc.
  • Figure out how to bind a specific config to a jupyter notebook. I often see repos with a few notebooks that each have a different environment, for example one notebook that shows exploratory data analysis with a specific docker container, followed by a machine learning notebook using a GPU docker container that is different from the first. I don’t want to go down the rabbit hole of proposing a solution to this, but something like having metadata embedded in the notebook somehow like yaml/json in the markdown that binder can use to override/locate the right config for a particular notebook?
  • Figure out how to suggest the type of compute that a notebook should run on (CPU, Memory, GPU, etc.) automatically in some kind of config that is bound to the notebook as well. The goal is to facilitate folks landing on the right compute footprint when they click the “launch on binder” button. If this is not specified, then I would try to route folks to a default compute footprint. I can appreciate that this is a somewhat complicated problem as there is no universal way to specify compute hardware across multiple clouds and infrastructure, but this is definitely a problem that users have when viewing eachother’s notebooks.
  • A way to sync the caching logic in Binder to somehow look for a docker container tagged with SHA before forcing a rebuild if new code has been pushed to GitHub. My goal is to be able to proactively always keep the environment fresh with GitHub Actions so that when people click on the “launch binder” button, the environment doesn’t have to build.

Once I figure out some of the above things, my plan was to publish GitHub Actions and materials that will:

  • Automatically provide a link to Binder corresponding to the relevant branch when someone opens a PR with a notebook.
  • Automatically detect notebooks without a binder link and have GitHub Actions automatically open PR adding a badge to various notebooks. (Perhaps the same for the README)
  • Show examples on three major public clouds (AWS, Azure, GCP) on how to host your own binder for your private team, ideally with some thoughts on cost management and dynamic scaling.

I’m happy to work on / help on any of these things, however, it would be useful to see if I have any blind spots or there are solutions to the above things that I do not know about. I also realized that I packed a whole bunch of things into this thread, and I’m happy to break these items into separate threads if that is useful. Thanks for your help

5 Likes

cc: @betatim this is the thread following up from https://twitter.com/betatim/status/1193583670751305730?s=20

Thanks for the interest and posting here!

Some quick replies and links to related discussions. The long answer to each of the points you mention is “it depends…” :smiley:

The word "Binder’ has many meaning so I will use BinderHub to refer to the software you’d use to run a site like mybinder.org on which users can start “Binders” from “repos”. These “repos” don’t have to be Git repositories, they could be almost anything that can be made to look like a directory (for example you can start from a zenodo.org deposit instead of a git repo).

To increase the potential for confusion even further BinderHub itself consists of several pieces of software (repo2docker, JupyterHub, etc) and you can configure all of them in different ways depending on what kind of deployment you have/setup you want.

After all this language lawyering let’s get to it :slight_smile:

You can setup a BinderHub with authentication instead of the fully anonymous mode we operate it in for mybinder.org. With auth comes the possibility to access private repos, offer persistent disk to users and let them push changes back to the source “repo”.

I can’t think of a reason why you couldn’t store a requirements.txt, environment.yml, etc in the metadata of a notebook. I can’t re-find it right now but I’ve seen someone try this with a prototype many years ago but it seems like it never took off. In the context of repo2docker it would be interesting to try this out because you could unpack the notebook and its metadata into separate files so that it “looks like a directory”, which is all we need. A content provider that does this would be cool.

Another thing to explore is to make some UI tool that takes the notebook (and its context) that the user wants to share and creates a Gist from that. Then share the Gist via BinderHub.

Beyond experiments I think things quickly get tricky in terms of how to make sane links for sharing, deal with ambiguity, etc.

A related issue from a while back https://github.com/jupyterhub/binderhub/issues/555

We have hesitated to add something here because we would prefer to reuse a file format for specifying CPU, RAM and other resources instead of having to invent one. You also quickly get into discussions of how to deal with the case where a repo specifies resources that a particular BinderHub does not have. And how to specify “custom” resources. There is https://github.com/jupyterhub/binderhub/issues/731 with some thoughts.

Related to this is “shouldn’t the operator of the BinderHub decide this?”. For example by giving more resources to repos from the entity that runs the BinderHub (think a university or company giving more CPU to repos on their GitHub org). And if we let the operators decide then how do we make the file format universal?

We try and be somewhat clever with rebuilds to make it faster if you only changed the contents but not the dependencies. How well this works depends on how old the docker cache of the nodes is and what changes you made/how you specify your dependencies (e.g. if a repo uses a setup.py then all bets are off when something in the repo changed as to what is safe to reuse and what isn’t. This is a deep end of the pool starting point for the ideas behind the “clever things we try”.

Besides all these trying to be clever approaches there is also the “throw money at it” solution: you can (and some people already do) trigger a build on mybinder.org via the API when they merge a PR. The reason this is the “throw money at it” solution is that as mybinder.org we could probably not afford it, if lots of repos started doing this. We’d have to rate limit builds per repo or something like that.

Yet another idea that has recently been floated is to check several docker registries for an existing image. Right now each BinderHub has its own docker registry which it checks. You could image that a BinderHub checks its local registry and then a global one like Docker Hub or one shared by the clusters that back mybinder.org. You’d need some form of trust between these registries and often moving a 2GB image across the internet can be as slow/fast as rebuilding it (right now for a repo that is already built the majority of the wait time is due to moving the image from the local docker registry to the right node in the cluster, e.g. between gcr.io and a VM inside the Google cloud.)

Yet another idea for faster startups (and maybe easier sharing of individual notebooks) is the idea of “binder base boxes” into which you pull an individual notebook. Check out Tip: embed custom github content in a Binder link with nbgitpuller and posts in that thread and threads linked from there.

Sorry for the very long reply. I hope you can find some reading material to help you get an idea of the ideas floating around. the main thing I wanted to get across is that there are ideas for a lot of the topics you raised and those that are unsolved/have no good ready made solution yet “just” need more brains, discussion and work. They aren’t fundamentally impossible or have been rejected as undesirable.

Happy to dig deeper on any of these things or clarify things that I didn’t explain very well/are confusing.

4 Likes

Thanks, @betatim ! This is very helpful! I’ll take a look at all of this and let you know if I have any questions

2 Likes

I love the idea of making the actions for binder.

As someone who has played around with trying to share work with Binder a lot, I’d like to just toss in my $0.02 on something Tim mentioned. I’m a bit impatient, so load times have been something I’ve spent a lot of time optimizing for. In fact, the rabbit hole I fell down with Docker definitely landed me my current architecting role.

“Boxed binders” using nbgitpuller has been the way I’ve sped up most of my launch times. I built a Docker image that works for 90%+ of my use-cases, and Binder seems happy to have it cached. I was inspired by @betatim’s kaggle-binder and what Ines did with spacy (isolating the binder builds to an orphan branch to minimize rebuilds).

So, having a “library” of “boxed binders” that offer a few configs may be a good solution for many users, and can keep resources lighter than the massive kaggle image.

I discovered last weekend that the images in jupyter/docker-stacks have the ability to run a pre-spawn hook. I started playing with this by making it clone a gist of mine and run some commands based on the content. This keeps “user environment settings” (or more) up to date without needing to edit the images themselves. Kind of like extending the nbgitpuller in a hacky way.

I manage a jupyterhub for a math department, so this is something I’m looking into to allow profs to “customize” (downgrade/upgrade/add/remove) the “base images” options I’m providing. I spent a bunch of time creating slimmed-down but feature-rich images based on the jupyter stacks, and am planning to offer them as a dropdown list enabled by Spawner.options (I think that’s what it’s called).

So, this isn’t super far off from what Binder can already do, especially if someone tunes the dockerfile, but that may be above what most people want to deal with. Being able to choose a boxed image that determines the hyperlink Binder gives you (as opposed to messing around with nbgitpuller on top of a base image like the kaggle-binder repo) could be more user-friendly.

One of the limitations of the nbgitpuller is that I can’t use multiple redirect_urls (to enabled lab by default, or open a notebook up, for example). My idea is allowing professors to edit a template of startup scripts, one of which tells it where to get the content, another environment, etc. Just a step below editing Dockerfiles themselves, but keeps my workload smaller since I don’t have to build images for each of them as things come up during the semester.

In any case, sign me up to test out anything/everything you come up with. I pay a lot of attention to how intuitive/easy these things are.

And the github actions automatically for PRs are amazing, but I totally understand how it can overwhelm resources. Can we somehow flag that these environments should be not kept as long in memory? (Close out PR = delete? Stale PR = delete?)

3 Likes

@hamel

There was some early thinking on dependency metadata fields in this Jupyter notebook thread: https://github.com/jupyter/nbformat/pull/60

I’m sure I’ve seen some other similar treatments over last few months but can’t find them offhand… Hmm… [Ah, this is maybe one of them: ipydeps]

One way round the hugely bloated kernel option is to provide multiple kernels with slightly different dependencies in each (although that can still result in a large image). JupyterWith takes that approach, I think?

See also a recent post on this forum which considers a “proxy kernel” that mediates specific kernel requirements: Guix-Jupyter: Towards self-contained, reproducible notebooks

By the by, the repo2docker github action to autobuild a docker image and push it to DockerHub also reminds me of this CircleCI recipe to do a similar thing: binder-examples/continuous-build

2 Likes

@psychemedia Thanks so much for these resources. I will also take a look at these and will follow up on this thread with questions!

1 Like

Hi everyone! This is a great discussion. Unfortunately I failed to review discourse before starting an exploration of GitHub Actions+Repo2Docker on my own this week, and it seems I would have benefited from @hamel’s pre-built action. Instead I modified the documentation @psychemedia pointed to for circleci. We’ve been using travis+repo2docker to create images used on different hubs and binderhubs here https://github.com/pangeo-data/pangeo-stacks.

I’m really happy with GitHub Actions so far:
** https://github.com/scottyhq/repo2docker-githubci **

In particular:

  • there is one less link in the chain (no accounts / credentials on circleci or travis)
    • this also makes it easier for others to copy the template repo and build custom images
  • by putting tagged images on dockerhub it is easy to deploy locally or on different binderhubs
  • cached builds seem to be working, the interface is pretty intuitive

I also experimented with hosting the built images themselves as GitHub Packages. This would be great because then everything is in one place and CI would operate without any secrets! But pulling these images requires generating a GitHub token so for now this is seems less intuitive compared to hosting on DockerHub.

3 Likes