Repo2Docker: make it easy to start from arbitrary docker image

I’ve been asked to start this discussion here after a gitter chat with @betatim and @jhamman. My apologies if I’m re-opening existing discussions, I’m quite new to BinderHub (and other hubs).

Related GH issue: https://github.com/jupyter/repo2docker/issues/487

TL;DR

I would like Binder and Repo2Docker to start from an arbitrary docker image made easy, i.e. not only for “expert users”.

Context

I am a scientist, not a computer scientist.

I’m maintaining a MyBinder repo which runs nicely (oggm-edu). But I’d like to build on this in order to offer a real computing environment, i.e. by setting-up our own hub with more resources and customize following the recommendations of the Pangeo project.

As a first step, I would like to start MyBinder from an existing docker image (that I am happy to modify in order to fit Repo2Docker’s needs). This docker image (and its daily tags) provide all the packages I need to run my glacier model in a reproducible way. We use these docker images intensively for CI and testing, but most importantly on our cluster via Singularity. We can now provide a dockerhub image tag along our scientific publications, which is great.

What I expect from this

There are several things I’d like to improve by making repo2docker start from our own image:

  • currently, our environment build is complex and large (many dependencies). The conda environment became so large and was breaking so often that I now install everything via pip instead (I went from a ~4.5Gb image size to 3.5Gb, which could further be reduced if Repo2Docker wouldn’t install conda per default). This results in messy build files and is silly because we have a working environment that we control on DockerHub already.
  • the default behavior of MyBinder is to rebuild everything after each commit to the repo. In practice, we change the content (the notebooks) very often, but almost never change the computing environment. Each time we update the notebooks, I am worryingly tracking the logs expecting that something is not going to install properly. This is silly, because we have a working environment that we know is working on DockerHub already.
  • in order to achieve reproducibility, I would have to pin all the packages in apt, requirements, etc on the github repository. I know that MyBinder allows to open previously built images via commit hashs, but what if I’d like to run the latest notebook changes on the latest computing environment that worked? Here again, building from a pinned, existing docker image would allow to change the analysis workflow (notebooks) while still keeping a frozen computing environment in the background.
  • since we can build from a fully functional docker image, the installation process of Repo2Docker would be much quicker (I assume).

Isn’t this possible already?

I guess so: (minimal-dockerfile example). This example however does not show how to add the repository files in the image, i.e. I can’t apply it myself (or maybe I missed something). So possibly, this is requiring some documentation changes on the Binder side, or some help for people like me.

Thanks for making Jupyter, and thanks for your help on this!

2 Likes

Thanks for the post! I agree this is an important discussion

One quick idea for your comments regarding separating the content from the environment: a hacky solution is to store your notebook files separately from the repository that defines your environment. The Binder links you create will point to the “environment repository”, and as a part of that repository you include (either via start or a Dockerfile) code to pull in the latest version of the “notebooks repository”. That way you can change the notebooks repository without changing the environment repository, and Binder wouldn’t trigger a re-build. it’s a bit hacky, but just wanted to throw the idea out there in case it’s helpful

Thanks! I understand how this could work make parts of the process less painful indeed.

The downside is that a couple of things won’t work properly: starting from a notebook path might not work, or at least the notebooks won’t show up in the nbviewer frame, right?

1 Like

Do you have a link to the Dockerfile you are currently using?

The mybinder.org docs contain a section on preparing your Dockerfile that show what you need to add to an existing one to make it work with mybinder.org.

The minimal Dockerfile example is more about what is the absolute bare minimum, so not the place to start from for “adjusting an existing dockerfile”.


If you are writing and maintaining a working Dockerfile already because that provides the best trade-off between simplicity and power I think for the purposes of repo2docker we’d consider you an expert. Letting people pick their own base image is technically easy but very quickly leads to situations where it is just as hard to debug why it isn’t working compared to writing a Dockerfile.

I don’t think it’s that hacky at all :grinning: It describes the relationship between a standard environment and potentially multiple notebook repositories. This is useful when you’ve got several people collaborating on a project and you don’t want everyone installing their own custom packages and ending up with several disjoint repos.

You could consider linking the repos in the other direction too. Have your main Dockerfile in its own repo, and tag the built image. Then in your notebook repos have a one-line Dockerfile that just references your main docker image:
FROM dockerhubuser/environment:1.2.3

This means you can have multiple notebook repos using the same base image.

3 Likes

@manics I wonder if this workflow should be a blog post :slight_smile:

1 Like

I should add that I’m not maintaining our Dockerfile, our HPC engineer is. But I couldn’t convince him to get enthusiastic about MyBinder yet, so here am I trying to get things to run.

Yes: https://github.com/OGGM/OGGM-Docker/blob/1ba50cc8e7293405029f7dac2fab6bfc714180b8/python37/Dockerfile
Hub: Docker

I think I showed this to our IT guy and he said he’s basically not going to do it for security reasons and other stuff I can’t remember. This is why I thought that building on the existing container and let Repo2Docker do the nasty details would be the best thing to do.

I’ll try to contact him again about meeting the requirements. But are you saying that the kind of things I’m asking (building a Repo2Docker compatible from our docker container) is not possible? Because I could convince him to use our minimal build as base and add the layers MyBinder needs on top of it, and if he can do this then I wonder why it can’t be automated by Repo2Docker (sorry for my ignorance about these things).

Right now repo2docker starts from a fixed base image and adds things to it. It will also build any arbitrary Dockerfile, under the condition that the docker image that is created needs to do a few things when it is launched and have various things in the place BinderHub expects them to be.

You could create a Dockerfile in the repository that you want to launch on a BinderHub that inherits from an existing docker image and adds the steps needed for the final image to work on a BinderHub. Something like the following sketch of a Dockerfile maybe:

FROM somewhere/base-image:v1234

ENV NB_USER jovyan
ENV NB_UID 1000
ENV HOME /home/${NB_USER}

RUN adduser --disabled-password \
    --gecos "Default user" \
    --uid ${NB_UID} \
    ${NB_USER}
RUN pip install --no-cache-dir notebook==5.*

COPY . ${HOME}
USER root
RUN chown -R ${NB_UID} ${HOME}
USER ${NB_USER}

However what exactly needs to be written here depends on the base image you are starting from. It needs a human brain to think about it for a few seconds or minutes. For example maybe in one base image it is pip3 and not pip or something else. This is why it is hard to build on top of arbitrary base images.

The automatic steps repo2docker performs make assumptions about what the world (base image) looks like when you execute them. Once we allow arbitrary base images these assumptions might not be true any more and so things will break. To be able to debug that breakage you need a deep understanding of the base image, what repo2docker wants to achieve, etc. At this point you are probably quicker to write your own Dockerfile that conforms to what BinderHub expects. To make matters worse: there aren’t that many people who need to do what you need to do, so there aren’t many people working on enabling it. The fact that it is a hard problem to solve doesn’t help, unfortunately.


As a side note, ignoring for a moment that situations are more complex than just “is it technically possible”: I am pretty sure you could produce a docker image that is equivalent to the one you linked based on current functionality of repo2docker (install APT and pip packages). That image might be larger than what you have right now but would make running on a BinderHub easier. Where the sweet spot in this trade-off lies is something you need to decide. It will always be a trade-off.

2 Likes

Thank @bebatim, I understand. For now I will go for the workaround of separating content from computing environment, which will already be much better. For later and a possible deployment on pangeo.io I will follow their lead and the discussion on https://github.com/jupyter/repo2docker/issues/487

@manics sorry I let this sleep for a while. Now I’m keen on trying this out.

For my understanding: if I want to be able to do FROM dockerhubuser/environment:1.2.3, I’ll have to push my rep2docker image manually, right? Or can I let mybinder build the image for me and find it afterwards? If yes, where do I find it?

Pushing the image to a registry should work. I don’t know whether you could retrieve the image built by mybinder. @betatim?

You can’t access the images that mybinder.org builds as they are in a private registry only accessible from within the mybinder.org cluster.

Just occurred to me: if you are doing this to speed up launch times I think it won’t work because we will have to pull the base image from docker hub (slower than our internal registry). It will however speed up build times.

I think I saw something about a docker hub mirror hosted inside Google’s cloud or some such. Something to look into.

My main goal is to not rebuild our computing environment each time we update the educational notebooks (see my original post above for more motivations).

What I’ve done yesterday:

My experience so far:

  • build time is a bit faster (because we don’t have to install everything), but not much faster (because pulling the image from docker hub is slow, and then binder still pushes a heavy image to the mybinder private registery)
  • I’m still stuck with the problem that the image which is then used by mybinder does not contain the notebooks of the content repository, which is of course a problem
  • It will still re initiate a repo2docker build each time we correct a typo in a notebook (it’s not that I care too much about the 10 minutes it takes, it’s more a general impression of waste of resources which bothers me)

So, basically, I’m back to @choldgraf 's original idea (March 18) which is to have a start script in the “environment repo” which pulls the “content repo” with the notebooks in it. I’ll try that out and report back

You need to add an instruction to the Dockerfile to copy over the contents of the repo.

An alternative to having a start script pull in content is to have a tag in the repo and link to that (instead of master), then move the tag once a day or when you have accumulated “enough changes”.

Would nbgitpuller work for you? https://github.com/jupyterhub/nbgitpuller/blob/30df8d548078c58665ce0ae920308f991122abe3/README.md

If all your content repos use your standard Docker image with no additions you could use your environment-only repo as the mybinder repo but include nbgitpuller parameters in all mybinder links to automatically clone the required repo.

3 Likes