Hello friends, I work at GitHub and I am exploring open source projects that can help make the notebook experience better (aside from the rendering issues which are being worked on) such as sharing & collaboration. One idea I have is to promote more usage of Binder with the following methods:
- Publish a series of GitHub Actions that encourage people to use binder & repo2docker. Example: repo2docker action.
- Write / contribute docs
Concretely, these are the problems I’m trying to solve:
- Enable public/private sharing of notebooks via binder. I’m not completely sure about how to facilitate the private case with various usernames and logins etc.
- Figure out how to bind a specific config to a jupyter notebook. I often see repos with a few notebooks that each have a different environment, for example one notebook that shows exploratory data analysis with a specific docker container, followed by a machine learning notebook using a GPU docker container that is different from the first. I don’t want to go down the rabbit hole of proposing a solution to this, but something like having metadata embedded in the notebook somehow like yaml/json in the markdown that binder can use to override/locate the right config for a particular notebook?
- Figure out how to suggest the type of compute that a notebook should run on (CPU, Memory, GPU, etc.) automatically in some kind of config that is bound to the notebook as well. The goal is to facilitate folks landing on the right compute footprint when they click the “launch on binder” button. If this is not specified, then I would try to route folks to a default compute footprint. I can appreciate that this is a somewhat complicated problem as there is no universal way to specify compute hardware across multiple clouds and infrastructure, but this is definitely a problem that users have when viewing eachother’s notebooks.
- A way to sync the caching logic in Binder to somehow look for a docker container tagged with SHA before forcing a rebuild if new code has been pushed to GitHub. My goal is to be able to proactively always keep the environment fresh with GitHub Actions so that when people click on the “launch binder” button, the environment doesn’t have to build.
Once I figure out some of the above things, my plan was to publish GitHub Actions and materials that will:
- Automatically provide a link to Binder corresponding to the relevant branch when someone opens a PR with a notebook.
- Automatically detect notebooks without a binder link and have GitHub Actions automatically open PR adding a badge to various notebooks. (Perhaps the same for the README)
- Show examples on three major public clouds (AWS, Azure, GCP) on how to host your own binder for your private team, ideally with some thoughts on cost management and dynamic scaling.
I’m happy to work on / help on any of these things, however, it would be useful to see if I have any blind spots or there are solutions to the above things that I do not know about. I also realized that I packed a whole bunch of things into this thread, and I’m happy to break these items into separate threads if that is useful. Thanks for your help