Embed binder-related metadata in notebook?

Apologies if this question has been asked before. I searched but didn’t come up with anything…

Consider a single notebook ipynb file, not necessarily part of any repo. Can this notebook tell me how to run itself in binder via notebook metadata?

Background

@choldgraf’s famous post:

opened my eyes to the possibilities of decoupling the binder image (which provides the execution environment) from the content, which can be pulled in via nbgitpuller. This concept now underlies Pangeo Gallery and binderbot. These tools launch notebooks in binders from the command line in an automated way. We generally need three pieces of data to specify the binder_image:

parameter description
binder_url Url for binder service in which to run the notebooks.
binder_repo Github repository which contains the repo2docker environment configuration.
binder_ref Branch, tag, or commit within binder_repo which contains the binder environment configuration.

We could eliminate the latter two if we could point directly to an appropriate docker image tag, e.g. via

Currently we embed these three parameters in an ad-hoc config file next to the notebooks. But what if we could embed them directly in the notebook metadata? Then we could have tools that could launch notebooks directly into the specified binder.

Proposal

The notebook json specification allows for adding arbitrary json metadata to the the notebook. All we need to do is standardize a convention for encoding this information. Here is one possibility:

{
  "metadata" : {
    "binder": {
      "binder_url": "https://binder.pangeo.io",
      "binder_repo": "pangeo-gallery/default-binder",
      "binder_ref": "master"
    }
  }
}

There are many details to consider here, and such a convention would need iteration and community input. But at its core, it’s simple enough that it should be doable without too much fuss.

Benefits

Implementing something like this would help the notebook sharing ecosystem. Specifically, a JupyterHub deployment (particularly a cloud-based one that uses repo2docker to build the environments) could be made aware of the repo / ref that were used to generate its environment and use these to automatically populate binder_repo and binder_ref in all notebooks saved by its users. These notebooks could then be stored anywhere—a repo, a gist, dropbox, or any bespoke notebook storing solution accessible over http (e.g. @yuvipanda’s https://notebooksharing.space/). A simple tool could examine the notebook, get the parameters, and generate an appropriate nbgitpuller link to open the notebook in binder. This would allow us to move much more freely between hubs and binders, perhaps helping JupyterHub and BinderHub eventually converge—an idea already under discussion:

Downsides

Most users would probably just ignore this metadata with no downsides. Malformed or incorrect binder metadata would lead to non-functional binders. By hiding the environment details, users would potentially become more ignorant about the details of their environments. There are probably many others I haven’t thought of.

2 Likes

This came up quite a long time ago in repo2docker:

I think Discourse is a better place to discuss this though, given it’s wider audience, so let’s continue the discussion here. For convenience I’ll paste @betatim’s comment from that issue below as it has some good points:

2 Likes

Welp, conda-env (now part of conda proper, but at the time, separate) already can do this since… a long time, and despite the deprecation threat, works to the time of writing (4.10):

But that’s neither here nor there: having the actual content in the .ipynb, and executing it naively in Docker, is probably the worst of all possible worlds, as “change whitespace in a markdown cell” would mean “rebuild the whole damn container,” which sounds like no fun at all.

But if The Chosen .ipynb had a named set of “virtual” REES in it, which got exploded in the build context, and then applied by “normal means,” (e.g. [[Dockerfile || [apt.txt, environment.yml || requirement.txt || ..., postBuild]], [start]] this wouldn’t be… terrible.

From a UX perspective, I could see replicating the repo2docker docs (and the buildpack precedence rules) in an in-UI list of known REES file names, and then offering a helpful, snippet-enabled editor for each file added, rather than trying to shoehorn some knowledge of multiple files outside the notebook. There are a number of in-the-wild Language Servers, linters, grammars, etc. for (most) of these file formats.

The big win for reproducibility/performance: many of these files have techniques which could be used to tighten a “humane” spec, authored any-old-place, into something more reproducible (and therefore, cacheable) for the (default) linux-64 platform, e.g.

  • environment.ymllinux-64.conda-lock
  • FROM random/upstream:latestFROM random/upstream@sha256:deadb33f

Making these easy to use (either in GUI or CLI) would probably lower the bar to getting performant, reproducible, easy-to-distribute artifacts that would be worth the ipynb JSON vs git headache.

2 Likes

Thanks a lot for pointing me to this existing conversations, and sorry for not doing my research better.

Both the repo2docker issues discussion and conda-env’s feature deal with trying to embed the actual REES inside the notebook. What I am proposing here is a slightly different and therefore possibly more lightweight / easy to implement. I am proposing to embed a reference to an external binder environment. My thinking is that many notebooks will use the exact same environment, so it would be duplicative and inefficient to build a custom binder image for every notebook.

Do people think that idea is worth pursuing?

1 Like

I think it is an interesting idea, because it gets around a lot of the struggles and problems of embedding the actual dependencies inside the notebook.

Within BinderHub there is a discussion/thinking of having a “quick launch” repo. The motivation for that is the observation that a lot of launches are to “just give me a scratch pad”, that most notebooks can execute in “any reasonable environment” and that if we know the repo ahead of time we can pre-launch a few copies of it which will result in “instantaneous” launch times.

I see a few places where we could start work in parallel:

  1. making a UI tool to add this metadata to a notebook. I think we need this to help people actually use this feature
  2. add a URL handler to binderhub where you provide a link to a notebook which is then launched in a (preconfigured) repository. Something like https://my-binderhub.org/v2/notebook?url=https://notebooksharing.space/some-notebook.ipynb

And once all this works we could extract the actual environment-repo from the notebook on launch.

For the metadata I’d propose something like

{
  "metadata" : {
    "binder": {
      "binder_url": "https://binder.pangeo.io",
      "environment_url": "https://github.com/pangeo-gallery/default-binder/branch/master",
    }
  }
}

The important point being that the repository is specified as a URL so that you can directly feed it to a image build tool like repo2docker. Instead of breaking it up into components.

What do you think?

2 Likes

I’ve been trying to find the relevant issues/PRs for the idea of “prelaunching images”. So far I can only find [Feature request] Prelaunching specified repositories · Issue #1167 · jupyterhub/binderhub · GitHub and the issues linked in it. I thought we had more :-/

Tim, I agree with everything you wrote! :heart: And I fully endorse the changes to the metadata specification. To me it seems like converging on this spec is an important prerequisite to the “work in parallel” implementation paths you outlined.

A UI tool would be very useful, agreed, and necessary for manually editing the metadata. But I think we should also provide a way for a notebook server to automatically populate this metadata on all the notebooks it saves. This would be important for the use case of making a cloud-based hub (or even a binderhub) create automatically-executable notebooks. This would greatly increase the number people who would use the feature, since no explicit knowledge of environments by the user would ever be required.

Unfortunately I don’t understand the notebook server architecture well enough to know where to start implementing something like that. But from the perspective of a cloud jupyterhub, it would be nice if this were ultimately something we could configure via the hub helm chart.

In general the notebook server e.g. JupyterLab doesn’t know anything about it’s external environment. It’s just a process running somewhere, whether that’s a container in the case of BinderHub, or on your local machine as a normal user process.

This means injecting the dependency information from outside. For example you could customise a spawner to pass the required information- this is what BinderHub already does to enable the buttons linking to the source repository and launch URL you can see on mybinder:

For example the above environment variables are used here in this extension

Based on that I think you can write your own JupyterLab/notebook extension right now to add a button to inject that metadata into your notebooks (or maybe don’t bother with the button and add it automatically when the notebook is saved). If you bundle that extension in your external binder environment then anyone launching a notebook on it should get the extension now.

To prototype the launch you could write a proof-of-concept command line utility that extracts the metadata from a notebook passed as a GitHub URL, then makes a call to the BinderHub API to launch the image and pulls the notebook.

1 Like

Sure, letting repo2docker be a black box and moving the concern up a layer to the launching application seems attractive for this use case, and lets REES stay “pure” in not inventing another config file.

Much like on formalizing the REES precedence rules, this really would just need a versioned spec (e.g. JSON schema) for someone (I’m not signing up, per se :blush: ) to start confidently building UX on top of it, as well some ways to specify “soft” reproducibility requirements (e.g. a git or docker tag) and have them be “frozen” into something (e.g. a hash) that was more reproducible (and therefore cacheable).

Further, since this is operating at the process launching layer, adopting and splitting the nbgitpuller concerns of runtime configuration (e.g. jupyter_config.json) and content (e.g. some-notebook.ipynb) would probably make a lot of sense. While this could be done with a custom start, it seems like something that could be handled more directly.

So a final configuration, might look like:

metadata:
  binder:
    version: 0
    launch:
      host: https://binder.pangeo.io
    image: # optionally
      host: https://registry.dockerhub.com
      path: a/built-container
      ref: master
      hash: abcd1234abcd1234abcd1234abcd1234
    build: # if image wasn't found, or has been yanked
      host:  https://github.com
      path: foo/a-big-environment
      ref: master
      hash: abcd1234abcd1234abcd1234abcd1234 # added when "freezing"
    config: # optionally
      host: https://github.com
      path: bar/our-configuration
      ref: master
      copy:
         jupyter_config.json: $HOME/.jupyter/etc/
    run:
      host: https://github.com
      path: baz/my-content
      ref: master
      copy:
        "*.ipynb": $HOME/ 
2 Likes