GitHub Actions + Binder

betatim · November 10, 2019, 10:45pm

Thanks for the interest and posting here!

Some quick replies and links to related discussions. The long answer to each of the points you mention is “it depends…”

The word "Binder’ has many meaning so I will use BinderHub to refer to the software you’d use to run a site like mybinder.org on which users can start “Binders” from “repos”. These “repos” don’t have to be Git repositories, they could be almost anything that can be made to look like a directory (for example you can start from a zenodo.org deposit instead of a git repo).

To increase the potential for confusion even further BinderHub itself consists of several pieces of software (repo2docker, JupyterHub, etc) and you can configure all of them in different ways depending on what kind of deployment you have/setup you want.

After all this language lawyering let’s get to it

You can setup a BinderHub with authentication instead of the fully anonymous mode we operate it in for mybinder.org. With auth comes the possibility to access private repos, offer persistent disk to users and let them push changes back to the source “repo”.

I can’t think of a reason why you couldn’t store a requirements.txt, environment.yml, etc in the metadata of a notebook. I can’t re-find it right now but I’ve seen someone try this with a prototype many years ago but it seems like it never took off. In the context of repo2docker it would be interesting to try this out because you could unpack the notebook and its metadata into separate files so that it “looks like a directory”, which is all we need. A content provider that does this would be cool.

Another thing to explore is to make some UI tool that takes the notebook (and its context) that the user wants to share and creates a Gist from that. Then share the Gist via BinderHub.

Beyond experiments I think things quickly get tricky in terms of how to make sane links for sharing, deal with ambiguity, etc.

A related issue from a while back Allow for multiple, different dependencies per repository · Issue #555 · jupyterhub/binderhub · GitHub

We have hesitated to add something here because we would prefer to reuse a file format for specifying CPU, RAM and other resources instead of having to invent one. You also quickly get into discussions of how to deal with the case where a repo specifies resources that a particular BinderHub does not have. And how to specify “custom” resources. There is Select pod resources from binder UI · Issue #731 · jupyterhub/binderhub · GitHub with some thoughts.

Related to this is “shouldn’t the operator of the BinderHub decide this?”. For example by giving more resources to repos from the entity that runs the BinderHub (think a university or company giving more CPU to repos on their GitHub org). And if we let the operators decide then how do we make the file format universal?

We try and be somewhat clever with rebuilds to make it faster if you only changed the contents but not the dependencies. How well this works depends on how old the docker cache of the nodes is and what changes you made/how you specify your dependencies (e.g. if a repo uses a setup.py then all bets are off when something in the repo changed as to what is safe to reuse and what isn’t. This is a deep end of the pool starting point for the ideas behind the “clever things we try”.

Besides all these trying to be clever approaches there is also the “throw money at it” solution: you can (and some people already do) trigger a build on mybinder.org via the API when they merge a PR. The reason this is the “throw money at it” solution is that as mybinder.org we could probably not afford it, if lots of repos started doing this. We’d have to rate limit builds per repo or something like that.

Yet another idea that has recently been floated is to check several docker registries for an existing image. Right now each BinderHub has its own docker registry which it checks. You could image that a BinderHub checks its local registry and then a global one like Docker Hub or one shared by the clusters that back mybinder.org. You’d need some form of trust between these registries and often moving a 2GB image across the internet can be as slow/fast as rebuilding it (right now for a repo that is already built the majority of the wait time is due to moving the image from the local docker registry to the right node in the cluster, e.g. between gcr.io and a VM inside the Google cloud.)

Yet another idea for faster startups (and maybe easier sharing of individual notebooks) is the idea of “binder base boxes” into which you pull an individual notebook. Check out Tip: speed up Binder launches by pulling github content in a Binder link with nbgitpuller - #31 by betatim and posts in that thread and threads linked from there.

Sorry for the very long reply. I hope you can find some reading material to help you get an idea of the ideas floating around. the main thing I wanted to get across is that there are ideas for a lot of the topics you raised and those that are unsolved/have no good ready made solution yet “just” need more brains, discussion and work. They aren’t fundamentally impossible or have been rejected as undesirable.

Happy to dig deeper on any of these things or clarify things that I didn’t explain very well/are confusing.

Topic		Replies	Views
How to reduce mybinder.org repository startup time discuss	60	42402	December 1, 2022
Embed binder-related metadata in notebook? Binder	8	1336	August 11, 2021
Repo2Docker: make it easy to start from arbitrary docker image discuss	16	3438	April 27, 2019
Jovian.ml increased usage in Binder General	8	1877	October 3, 2020
Tip: speed up Binder launches by pulling github content in a Binder link with nbgitpuller discuss tip	73	11638	January 7, 2024

GitHub Actions + Binder

Related topics