Binder Notebook Builder Bot

This is an idea we came up with at the 2019 Pangeo Community Meeting. The aim of my post is to encourage people to work on something I don’t really have the expertise to do myself, but which I think would have a huge impact.

Many projects have an “example gallery” of notebooks (example: dask). It’s great to have the these examples live in a binder, in which case the notebooks are stored with output cleared. But often we also want a fully executed notebook to live in a static documentation site. Where should this execution happen? For many build systems, like dask’s, it happens in CI.

But this is not always ideal. Sometimes the binder environment can be quite complex, involving a big set of dependencies or customized access to resources (as in Pangeo’s binder). This makes it hard to recreate the proper build environment in CI.

I am proposing a tool, and associated bot, that uses the Jupyterhub API to execute the notebooks within their own binder. Specifically, this tould would

  • Launch repo binder via the jupyterhub API
  • Use the API to run each notebook
  • Download the executed notebooks out of the running binder

In bot form, this service could watch a repo for changes to the master branch and, when it detects a change to a notebook, run this workflow and generate a PR to a rendered branch. That way the repo could contain both blank and rendered notebooks, with the bot keeping them in sync. This sort of continuous integration would also serve as form of quality control, keeping a binder fresh and functional as its content evolves.

I think such a tool could actually be the foundation of a notebook based publication service, which has been discussed many times.

First Steps

I made some tiny progress on the API stuff at the Pangeo hackathon, but got stuck because I don’t know how to do async programming:
https://nbviewer.jupyter.org/gist/rabernat/378d0bf2c0896522e256cacdc5ced9ee

@yuvipanda also has some relevant examples of using the hub API in hubtraf:
https://github.com/yuvipanda/hubtraf/blob/6904b596576db8af4f3101ef6896fa81a8ab8e58/hubtraf/user.py

There may some interest from @jsignell in working on this.

Keen to hear thoughts from the community.

Another starting point could be having something like repo2docker . papermill my-notebook.ipynb in your CI which would take care of the complex environment part, running the notebooks and having rendered notebooks in your docs.

Combined with something like this in your postBuild to provide empty notebooks in the binder:

# clean the output from the notebooks
for d in *ipynb; do
    python clean_notebook.py $d;
done;

(script clean_notebook.py).

@betatim - thanks for the suggestion! I agree that this would work for some scenarios.

With Pangeo, we are dealing with binder environments that process very big datasets (requiring lots of memory / CPU) and can do special things like launch dask clusters via kubernetes. So we really can’t use CI (at least freely available services like travis) for this purpose. That’s the use case I have in mind.

1 Like

That makes sense. I had thought because you were talking about docs/examples that it would be things that only take seconds to run.

Sometime last year papermill added support for “execution engines” https://papermill.readthedocs.io/en/latest/extending-entry-points.html#developing-a-new-engine which was motivated by being bale to run notebooks on Kaggle or a BinderHub. I made an initial attempt for Kaggle but that didn’t work out (their API was too limited). Maybe someone wants to pick that up again as I don’t think anyone has tried to add a Binder engine.

This is also very related to Performing actions on the behalf of users, where the idea is that notebooks/scripts that live in a z2jh/pangeo-like setup can be scheduled to be executed, possibly via a project like papermillhub (still very much a WIP).

1 Like

I do think that the ask here is probably much simpler than the jupyterhub issue since we are not trying to know about user secrets or any local data.

Maybe it’d be helpful to try to think about what this would look like from the content-contributor’s perspective. Are you imagining that people will write a notebook on pangeo binder, then download it, check it into a fresh repo, and point to that repo from somewhere? That repo would then depend on the pangeo binder being in the same place and have no record of its own of what was used to run the notebook. Even if there is a record, there would be no validation of its correctness.

The fact that the pangeo binder “can do special things like launch dask clusters via kubernetes” is mildly worrying since this seems like it would make these artifacts less reproducible. I’d rather that things could be run just as well (at least in theory) locally as in binder, or anywhere else.

I’m also interested in what types of notebooks you are imagining people will add. Do you think they’ll look like papers and if so do you think it is common that an entire paper will be runnable without special HPC access and with all the data publicly available?