Embed binder-related metadata in notebook?

rabernat · August 10, 2021, 10:33am

Apologies if this question has been asked before. I searched but didn’t come up with anything…

Consider a single notebook ipynb file, not necessarily part of any repo. Can this notebook tell me how to run itself in binder via notebook metadata?

Background

@choldgraf’s famous post:

opened my eyes to the possibilities of decoupling the binder image (which provides the execution environment) from the content, which can be pulled in via nbgitpuller. This concept now underlies Pangeo Gallery and binderbot. These tools launch notebooks in binders from the command line in an automated way. We generally need three pieces of data to specify the binder_image:

parameter	description
`binder_url`	Url for binder service in which to run the notebooks.
`binder_repo`	Github repository which contains the repo2docker environment configuration.
`binder_ref`	Branch, tag, or commit within `binder_repo` which contains the binder environment configuration.

We could eliminate the latter two if we could point directly to an appropriate docker image tag, e.g. via

github.com/jupyterhub/binderhub

Option to launch a binder directly from a dockerhub image (bypass repo2docker completely)

opened 03:49PM - 14 May 21 UTC

rabernat

enhancement

### Proposed change In Pangeo, we use CI to build complex docker images wit…h our full stack in https://github.com/pangeo-data/pangeo-docker-images. These images are used directly in various Pangeo JupyterHubs. We also want to use the same images in binder. We nearly always use use the [nbgitpuller trick](https://jupyterhub.github.io/nbgitpuller/link) to use separate repos for the binder env and contents. Currently this requires making a "passthrough" repo with a single-line Dockerfile pointing at the desired image on Dockerhub, e.g.: <https://github.com/pangeo-gallery/default-binder/blob/master/binder/Dockerfile> Maintaining this "passthrough" repo is an extra step that leads to unnecessary complexity and also wastes binder resources rebuilding docker containers that are unchanged from the dockerhub version. **I would love to have an option to launch a binder directly from a dockerhub (or other container registry) image, completely bypassing repo2docker.** ### Alternative options Just keep doing what we are doing now, which works fine but requires additional complexity. ### Who would use this feature? [Pangeo Gallery](http://gallery.pangeo.io/) and the entire Pangeo project would use this feature heavily. More generally, this feature would help bridge the gap between cloud-based JupyterHubs using pre-built docker images and Binders, improving interoperability between environments. _It would make it trivial to launch a binder with an identical environment to a cloud-based JupyterHub_, without requiring users to mess around with Dockerfiles. ### (Optional): Suggest a solution I don't know enough about how binderhub works to propose an implementation.

Currently we embed these three parameters in an ad-hoc config file next to the notebooks. But what if we could embed them directly in the notebook metadata? Then we could have tools that could launch notebooks directly into the specified binder.

Proposal

The notebook json specification allows for adding arbitrary json metadata to the the notebook. All we need to do is standardize a convention for encoding this information. Here is one possibility:

{
  "metadata" : {
    "binder": {
      "binder_url": "https://binder.pangeo.io",
      "binder_repo": "pangeo-gallery/default-binder",
      "binder_ref": "master"
    }
  }
}

There are many details to consider here, and such a convention would need iteration and community input. But at its core, it’s simple enough that it should be doable without too much fuss.

Benefits

Implementing something like this would help the notebook sharing ecosystem. Specifically, a JupyterHub deployment (particularly a cloud-based one that uses repo2docker to build the environments) could be made aware of the repo / ref that were used to generate its environment and use these to automatically populate binder_repo and binder_ref in all notebooks saved by its users. These notebooks could then be stored anywhere—a repo, a gist, dropbox, or any bespoke notebook storing solution accessible over http (e.g. @yuvipanda’s https://notebooksharing.space/). A simple tool could examine the notebook, get the parameters, and generate an appropriate nbgitpuller link to open the notebook in binder. This would allow us to move much more freely between hubs and binders, perhaps helping JupyterHub and BinderHub eventually converge—an idea already under discussion:

Downsides

Most users would probably just ignore this metadata with no downsides. Malformed or incorrect binder metadata would lead to non-functional binders. By hiding the environment details, users would potentially become more ignorant about the details of their environments. There are probably many others I haven’t thought of.

manics · August 10, 2021, 4:18pm

This came up quite a long time ago in repo2docker:

github.com/jupyterhub/repo2docker

Add a JupyterNotebook buildpack and contentprovider

opened 08:41PM - 04 Dec 19 UTC

choldgraf

needs: discussion

This would be a major extension to how repo2docker works, so I don't think it ne…eds to happen anytime soon, but is worth discussing. I've had a number of conversations now where people suggest it'd be good to have the *entire environment* encapsulated in a single Jupyter Notebook. E.g., rather than sharing a repository of files, they'd just share a single file with all of the information needed in it. This could be done if we implemented a `JupyterNotebookBuildPack`. I imagine that it could do something like: 1. `detect()` if the input path were a single file that ends in `.ipynb` and that has a notebook-level metadata field (e.g., `binder/` or `environment/`). 2. Within that metadata field would be another dictionary, where the keys are the full filenames of repo2docker configuration files (e.g., `requirements.txt`, `REQUIRE`) and the value of each key is a list of lines. The BuildPack then runs a second round of `detect()` using the other BuildPacks, and assembles an environment following whatever it finds. Something like: ```yaml env: requirements.txt: - numpy - matplotib runtime.txt: - r-YYYY-MM-DD ``` Could this be implemented without much added complexity? *Should* this be implemented at all?

I think Discourse is a better place to discuss this though, given it’s wider audience, so let’s continue the discussion here. For convenience I’ll paste @betatim’s comment from that issue below as it has some good points:

The idea of “store dependencies in the notebook”. I think one of the earliest is conda execute a Jupyter notebook · Issue #3 · conda-tools/conda-execute · GitHub. (I keep ref’ing git when I see this topic come up as a way of finding all the places it comes up).

I’d implement it as a content provider that does exactly what Simon suggested but instead of making a new main() use the repo2docker stuff already

One thing we’d need to work out is: how do the files get into the notebook in the first place? I am running my notebook in a conda environment … now I want to share it. How do I store the env into the notebook? Connected to this is why do people want to share a notebook instead of a directory of requirements.txt and Untitled99.ipynb? My hunch is that it is because “just send a notebook” is an easier task to accomplish compared to zip’ing up a directory (or attaching two files) to the email. So how do we make “create the env file, put it into the notebook metadata, send by email” as simple as “send notebook by email”?

I suspect that the majority of users that ask about “just wanna share a notebook” haven’t realised that they’d have to complete the put-environment.yml-in-the-notebook task to be able to do this. At least that is the impression I have from talking with people who wish for this feature.

We can spin this two ways:

this feature will be no more popular than sharing a directory when users realise what the true number of tasks is they need to complete

we need to build tooling that automates as much as possible the auxiliary steps (auto deduce the env, bundle notebook and env info into one file, automagically and quickly build the env on the receiving end) involved

A thing to keep in the back of our minds is that building a docker container is pretty slow. People already hate waiting for conda, so we should think about what kind of (evil) tricks we can play in terms of caching and re-using existing images for seemingly different notebooks (sharing docker image across “repos” in the repo2docker language).

The other thing to ponder is why, if in 2015 people already discussed this, did this not take off? Is there something we are missing or was 2015 just too early?

bollwyvl · August 11, 2021, 1:09am

Welp, conda-env (now part of conda proper, but at the time, separate) already can do this since… a long time, and despite the deprecation threat, works to the time of writing (4.10):

github.com

conda/conda/blob/33a142c16530fcdada6c377486f1c1a385738a96/conda_env/specs/notebook.py#L30

    
      
              result = self._can_handle()
              if result:
                  print("WARNING: Notebook environments are deprecated and scheduled to be "
                        "removed in conda 4.5. See conda issue #5843 at "
                        "https://github.com/conda/conda/pull/5843 for more information.")
              return result
          
          
def _can_handle(self):
              try:
                  self.nb = nbformat.reader.reads(open(self.name).read())
                  return 'environment' in self.nb['metadata']
              except AttributeError:
                  self.msg = "Please install nbformat:\n\tconda install nbformat"
              except IOError:
                  self.msg = "{} does not exist or can't be accessed".format(self.name)
              except (nbformat.reader.NotJSONError, KeyError):
                  self.msg = "{} does not looks like a notebook file".format(self.name)
              except Exception:
                  return False
              return False

But that’s neither here nor there: having the actual content in the .ipynb, and executing it naively in Docker, is probably the worst of all possible worlds, as “change whitespace in a markdown cell” would mean “rebuild the whole damn container,” which sounds like no fun at all.

But if The Chosen .ipynb had a named set of “virtual” REES in it, which got exploded in the build context, and then applied by “normal means,” (e.g. [[Dockerfile || [apt.txt, environment.yml || requirement.txt || ..., postBuild]], [start]] this wouldn’t be… terrible.

From a UX perspective, I could see replicating the repo2docker docs (and the buildpack precedence rules) in an in-UI list of known REES file names, and then offering a helpful, snippet-enabled editor for each file added, rather than trying to shoehorn some knowledge of multiple files outside the notebook. There are a number of in-the-wild Language Servers, linters, grammars, etc. for (most) of these file formats.

The big win for reproducibility/performance: many of these files have techniques which could be used to tighten a “humane” spec, authored any-old-place, into something more reproducible (and therefore, cacheable) for the (default) linux-64 platform, e.g.

environment.yml → linux-64.conda-lock
FROM random/upstream:latest → FROM random/upstream@sha256:deadb33f

Making these easy to use (either in GUI or CLI) would probably lower the bar to getting performant, reproducible, easy-to-distribute artifacts that would be worth the ipynb JSON vs git headache.

rabernat · August 11, 2021, 8:38am

Thanks a lot for pointing me to this existing conversations, and sorry for not doing my research better.

Both the repo2docker issues discussion and conda-env’s feature deal with trying to embed the actual REES inside the notebook. What I am proposing here is a slightly different and therefore possibly more lightweight / easy to implement. I am proposing to embed a reference to an external binder environment. My thinking is that many notebooks will use the exact same environment, so it would be duplicative and inefficient to build a custom binder image for every notebook.

Do people think that idea is worth pursuing?

betatim · August 11, 2021, 9:24am

I think it is an interesting idea, because it gets around a lot of the struggles and problems of embedding the actual dependencies inside the notebook.

Within BinderHub there is a discussion/thinking of having a “quick launch” repo. The motivation for that is the observation that a lot of launches are to “just give me a scratch pad”, that most notebooks can execute in “any reasonable environment” and that if we know the repo ahead of time we can pre-launch a few copies of it which will result in “instantaneous” launch times.

I see a few places where we could start work in parallel:

making a UI tool to add this metadata to a notebook. I think we need this to help people actually use this feature
add a URL handler to binderhub where you provide a link to a notebook which is then launched in a (preconfigured) repository. Something like https://my-binderhub.org/v2/notebook?url=https://notebooksharing.space/some-notebook.ipynb

And once all this works we could extract the actual environment-repo from the notebook on launch.

For the metadata I’d propose something like

{
  "metadata" : {
    "binder": {
      "binder_url": "https://binder.pangeo.io",
      "environment_url": "https://github.com/pangeo-gallery/default-binder/branch/master",
    }
  }
}

The important point being that the repository is specified as a URL so that you can directly feed it to a image build tool like repo2docker. Instead of breaking it up into components.

What do you think?

betatim · August 11, 2021, 10:00am

I’ve been trying to find the relevant issues/PRs for the idea of “prelaunching images”. So far I can only find [Feature request] Prelaunching specified repositories · Issue #1167 · jupyterhub/binderhub · GitHub and the issues linked in it. I thought we had more :-/

rabernat · August 11, 2021, 10:58am

Tim, I agree with everything you wrote! And I fully endorse the changes to the metadata specification. To me it seems like converging on this spec is an important prerequisite to the “work in parallel” implementation paths you outlined.

A UI tool would be very useful, agreed, and necessary for manually editing the metadata. But I think we should also provide a way for a notebook server to automatically populate this metadata on all the notebooks it saves. This would be important for the use case of making a cloud-based hub (or even a binderhub) create automatically-executable notebooks. This would greatly increase the number people who would use the feature, since no explicit knowledge of environments by the user would ever be required.

Unfortunately I don’t understand the notebook server architecture well enough to know where to start implementing something like that. But from the perspective of a cloud jupyterhub, it would be nice if this were ultimately something we could configure via the hub helm chart.

manics · August 11, 2021, 11:21am

In general the notebook server e.g. JupyterLab doesn’t know anything about it’s external environment. It’s just a process running somewhere, whether that’s a container in the case of BinderHub, or on your local machine as a normal user process.

This means injecting the dependency information from outside. For example you could customise a spawner to pass the required information- this is what BinderHub already does to enable the buttons linking to the source repository and launch URL you can see on mybinder:

github.com

jupyterhub/binderhub/blob/1957bfb0df4bffd2cd31d6915a6d132426c614ae/helm-chart/binderhub/values.yaml#L113-L124

    
      
          def get_env(self):
              env = super().get_env()
              if 'repo_url' in self.user_options:
                  env['BINDER_REPO_URL'] = self.user_options['repo_url']
              for key in (
                      'binder_ref_url',
                      'binder_launch_host',
                      'binder_persistent_request',
                      'binder_request'):
                  if key in self.user_options:
                      env[key.upper()] = self.user_options[key]
              return env

For example the above environment variables are used here in this extension

Based on that I think you can write your own JupyterLab/notebook extension right now to add a button to inject that metadata into your notebooks (or maybe don’t bother with the button and add it automatically when the notebook is saved). If you bundle that extension in your external binder environment then anyone launching a notebook on it should get the extension now.

To prototype the launch you could write a proof-of-concept command line utility that extracts the metadata from a notebook passed as a GitHub URL, then makes a call to the BinderHub API to launch the image and pulls the notebook.

bollwyvl · August 11, 2021, 12:25pm

Sure, letting repo2docker be a black box and moving the concern up a layer to the launching application seems attractive for this use case, and lets REES stay “pure” in not inventing another config file.

Much like on formalizing the REES precedence rules, this really would just need a versioned spec (e.g. JSON schema) for someone (I’m not signing up, per se ) to start confidently building UX on top of it, as well some ways to specify “soft” reproducibility requirements (e.g. a git or docker tag) and have them be “frozen” into something (e.g. a hash) that was more reproducible (and therefore cacheable).

Further, since this is operating at the process launching layer, adopting and splitting the nbgitpuller concerns of runtime configuration (e.g. jupyter_config.json) and content (e.g. some-notebook.ipynb) would probably make a lot of sense. While this could be done with a custom start, it seems like something that could be handled more directly.

So a final configuration, might look like:

metadata:
  binder:
    version: 0
    launch:
      host: https://binder.pangeo.io
    image: # optionally
      host: https://registry.dockerhub.com
      path: a/built-container
      ref: master
      hash: abcd1234abcd1234abcd1234abcd1234
    build: # if image wasn't found, or has been yanked
      host:  https://github.com
      path: foo/a-big-environment
      ref: master
      hash: abcd1234abcd1234abcd1234abcd1234 # added when "freezing"
    config: # optionally
      host: https://github.com
      path: bar/our-configuration
      ref: master
      copy:
         jupyter_config.json: $HOME/.jupyter/etc/
    run:
      host: https://github.com
      path: baz/my-content
      ref: master
      copy:
        "*.ipynb": $HOME/

Topic		Replies	Views
Tip: speed up Binder launches by pulling github content in a Binder link with nbgitpuller discuss tip	73	11606	January 7, 2024
Use published Docker image for Binder Binder docker	3	1202	September 7, 2021
Installing a jupyter notebook extension on Binderhub Binder how-to , help-wanted	2	965	July 12, 2019
Binder Notebook Builder Bot Binder	6	1252	April 14, 2020
Best practices on editing/testing locally before publishing to Binder? discuss	1	1313	November 25, 2019

Embed binder-related metadata in notebook?

Background

Proposal

Benefits

Downsides

Related topics