Creating a library of notebooks each being individually executable

johnjarmitage · June 12, 2020, 12:50pm

I want to create a sort-of library of notebooks within one github repository. The notebooks will do very different tasks and each have their own python environment. In fact they might not be exclusively python based. It would be great if each had their own mybinger.org link, but from my understanding repo2docker works off the github repo. I would want to instead have docker images created from folders within the repository. Is such an idea possible?

manics · June 12, 2020, 2:17pm

It’s not currently supported. Here’s a related thread with some workarounds:

Shall we move the discussion there?

johnjarmitage · June 12, 2020, 3:54pm

With some colleagues, we might have a look at this over the weekend. I am totally not an expert in Docker and Binder, but the workarounds suggested in the previous discussion look like a good place to start. Thanks!

betatim · June 12, 2020, 7:01pm

Taking a step back: Why do you want to have separate environments on mybinder.org?

Depending on why you want to have different container images for the notebooks the answer might be “you don’t actually want different images”. Let me explain

Assuming three things:

the environments are not incompatible with each other (all dependencies can be installed in one env)
launching the image is more common than building the image (for every commit you have at least two launches)
you want fast launches for your users

Under those assumptions there are benefits to using the same container image for all of the notebook on mybinder.org. The combined image might be larger than any individual image but it might still lead to faster launches for your users.

This is because different images can be assigned to different clusters. In which case each cluster will have to build the image. If all your launches use the same image, there is a good chance that all launch attempt get assigned to the same cluster (no rebuilding on first launch) and they might even all get launched on the same node (no transferring of the image from the registry to the node).

When you make a change to your repo and it needs re-building we try very hard to assign the re-build job to the same node in the same cluster that built the original image. This increases your chances of the build process reusing as many layers as possible through the magic of docker caching.

Both launches and re-builds rely on things being cached. We have large caches on the nodes, but eventually we do have to empty them. We try and empty them starting with the least recently used stuff. Another reason to share images because then you get the combined “oumph” of all launches, instead of spreading your launches across N images, which will make them look less popular.

Of course all the caching and re-using is an optimisation. Nodes come and go, there might be other super popular images crowding you out of the caches, etc. However the worst case in the shared image and many images case is the same. But you could get an edge from sharing the image.

(I can construct counter examples in my head but I’d classify those as edge cases. For example each individual image is super small and fast to build, but the combined one is huge and slow to build (despite consisting of instructions which individually are fast.)

So overall I’d take a step back and ponder why you want to split things. Maybe even do some experimenting to inject some data into all this hypothesising One of my favourite quotes “benchmarking gives me a leg up on all those who are too good to benchmark before optimising”.

johnjarmitage · June 12, 2020, 7:38pm

Yes. The group I am working with were divided on this. We want to develop a public library of notebooks where anyone can submit their work. Therefore the list of dependencies is to a degree unknown. But your points look spot on. The solution might be to try to anticipate all dependencies in the repository and then have individual installs within the first cell of the notebooks to catch what is not already there. I gave it a shot right now here https://github.com/johnjarmitage/notebook-library and it works nicely for my one example. The Friday night speculation is because the hackathon is tomorrow…

betatim · June 12, 2020, 8:26pm

The “install it all” approach can take you a very long way, every data-science package ever in one big docker image:

github.com

Kaggle/docker-python/blob/c703d3307cc576069678828693f7572412d81572/Dockerfile

# b/157908450 set to latest once numba 0.49.x fixes performance regression for datashader.
ARG BASE_TAG=m46
ARG TENSORFLOW_VERSION=2.2.0

FROM gcr.io/kaggle-images/python-tensorflow-whl:${TENSORFLOW_VERSION}-py37-2 as tensorflow_whl
FROM gcr.io/deeplearning-platform-release/base-cpu:${BASE_TAG}

ADD clean-layer.sh  /tmp/clean-layer.sh
ADD patches/nbconvert-extensions.tpl /opt/kaggle/nbconvert-extensions.tpl

# This is necessary for apt to access HTTPS sources
RUN apt-get update && \
    apt-get install apt-transport-https && \
    /tmp/clean-layer.sh

    # Use a fixed apt-get repo to stop intermittent failures due to flaky httpredir connections,
    # as described by Lionel Chan at http://stackoverflow.com/a/37426929/5881346
RUN sed -i "s/httpredir.debian.org/debian.uchicago.edu/" /etc/apt/sources.list && \
    apt-get update && \
    # Needed by vowpalwabbit & lightGBM (GPU build).

This file has been truncated. show original

(not advocating this, more of an item in the cabinet of curiosities)

And you can make it run on Binder if you want to:

Topic		Replies	Views
Allow for multiple, different dependencies per repository discuss	1	1505	April 23, 2020
GitHub Actions + Binder Binder community , how-to	7	2337	November 22, 2019
Reproducible Jupyter Notebooks with Docker General	1	472	October 23, 2019
Error launching server Binder	5	1366	July 6, 2020
Installing a jupyter notebook extension on Binderhub Binder how-to , help-wanted	2	964	July 12, 2019

Creating a library of notebooks each being individually executable

Related topics