"reproducible" binder environments with repo2docker, dockerhub and nbgitpuller

Cautionary note: I’m not a Binder expert. Feel free to comment / discuss below, I’m happy to improve this post!

Edits: I took @betatim’s post below into account for certain points.

After I spent the last two days debugging code which broke because of updates in our dependencies (e.g. a bad mpl bug), it occurred to me that our MyBinder tutorials will be broken as well, because each new commit on a binder repository would trigger a new conda install, with broken packages in it.

Solution 1: pin your packages

The recommended way to deal with the issue is to pin packages in your environment.yml file. However, in conda this is easier said than done (we have quite a few dependencies). And, more problematically, package versions and their inter dependencies are not guaranteed on conda-forge: a pinned file that worked one day might not work another day.

Solution 2: pin your MyBinder link

Another way to deal with the issue is to make a MyBinder link which points to a commit which works (e.g. before a package update which broke your notebooks). As long as MyBinder remembers that it has built this commit in the past, the image will work. The problem with this pin is that you can’t update your notebook content (a new commit triggers a build), and MyBinder doesn’t make any guarantee that they will store our images “forever” (the MyBinder image registry is a temporary cache, as explained below).

Solution 3: repo2docker, dockerhub and nbgitpuller

This solution is inspired by the separation of content and environment described here. We go one step further, with three repositories:

  • an environment description repository (example) with all the config files (env, postBuild, etc). This environment is built on Travis (at each commit and each week with a cron-job) with repo2docker and pushed to dockerhub. The build on Travis allows to run tests (either on our dependencies or our own content) and therefore ensures that the pushed images are “working”.
  • a Binder env repository (example) which does nothing else then pulling from dockerhub with a Docker file. This way, you can tag the exact version of the image you want to build and don’t rely on any other tool than dockerhub to store your images. This is the repository which we link to on Binder. When Binder builds an image from it, it is going to push from DockerHub, which is usually faster than a full repo2docker build but can be slow and might have drawbacks, as explained below.
  • one (or more) “content repositories” with the notebooks (or code) you would like to share on Binder (example). This content is pulled into your Binder envs with nbgitpuller (example documentation).

This set-up is of course more complex than a single “binder ready” repository. But there are a couple of advantages:

  • you can update the content (notebooks) frequently without triggering an environment build (which you only rarely want to do).
  • other people can use your environment with their own content (for example, Lizz improved and translated our notebooks into spanish for a class).
  • you can add test to your travis script, ensuring that the envs you are building work for you (edit: an undocumented - and better - alternative is explained below).
  • as long as dockerhub exists, reproducibility and a proper “time machine” is ensured. You can also store your images elsewhere if you want.
  • the actual reason why we did this in the first place is that we need the repo2docker images available on Dockerhub in order to use then for our jupyterhub
  • this set-up is more flexible than pinning dependencies to a fixed version. Often, it is very hard to find a combination of packages that work together, and most of the time you actually want to update your dependencies. Pinning packages is a tedious process, downloading from dockerhub is arguably easier.

Thoughts?

3 Likes

Our cache of previously built images is exactly that: a cache. This means it could disappear at any moment. In the past we used to be more aggressive at emptying the cache, forcing every repository to be rebuilt. In practice this causes people a lot of pain because maintaining a setup that will build today, tomorrow, in 3, 6, 12 months is hard. We do have to clean out the registry periodically as it gets very large (last time we looked we deleted 50TB of images).

A cautionary note on using base images on docker hub

There is one definitive disadvantage: pulling the image from dockerhub is slower than pulling from the registry hosted as part of each cluster (OVH and GKE). The second disadvantage is a hunch and hence unconfirmed: dockerhub probably throttles the bandwidth based on IP addresses. As all traffic from one of our BinderHub clusters appears to come from one IP I’d bet docker hub will throttle us, making it even slower to pull the base image, especially if many people use this trick. I am not sure what we could do to help with this.

A note on knowing if your repo will build on mybinder.org:

People have success with running repo2docker . ./verify as part of their CI. The verify script is one that runs tests and/or other checks to decide if the image builds and “works”.

With this you can be pretty confident that building that particular commit within a reasonable window of days will succeed on mybinder.org (but the universe keeps changing so it might work now but in 5min it will have stopped working because someone somewhere did something).


An anectdote

The only way to know if a particular commit will continue to build with repo2docker is to regularly try to build it. As part of the repo2docker CI we build “frozen” versions of repositories to check we didn’t break things. Recently a repository that uses a pinned, ~12-18months old version of matplotlib and numpy and Python 2 stopped building over night because there was a new pre-release of numpy. Yes, this is crazy but what can you do. What we learnt is that the build had worked in the past because we were lucky and in the recent pre-release of numpy our luck ran out and the bug that had been there all along surfaced.

This was a good learning experience. My impression is that 6months is a time frame you can expect things to keep working (if you did a good job pinning dependencies). For things to still “just work” 12 months later you have to be lucky. 18months later it is highly unlikely your repository will still build.

If you are happy with having a “big fat binary” then storing the docker image from day X is a good option. If you ever need to recreate or reproduce that big fat binary you need to exercise the build process regularly to make sure it keeps working. I think using travis cron jobs to run repo2docker --ref=<yourspecialref> . ./verify once a week is a good idea.


Fast rebuilds when only content changes

We recently added a new feature to repo2docker where a rebuild of a previously built repository should be very fast, if you didn’t change any of the dependencies. This works as long as your environment.yml, requirements.txt or install.R do not refer to anything in the repo, e.g. no -e . in your requirements.txt. The easy way to test this is to run repo2docker --no-run some/local/repo, then edit the README in some/local/repo and run repo2docker --no-run some/local/repo again.

For this feature to be useful on a BinderHub some more work is needed, the approach to take is in this issue. As with everything: help is always welcome and will almost always reduce time-to-being-done :slight_smile:

3 Likes

Thanks for the clarification @betatim!

I don’t really understand why this is a problem. In our binder repo we do pull from DockerHub, yes, but this is done by Binder only once to do the image which is then stored in cache. It needs to be done again when the cache is emptied, yes, or when a new build is required (e.g. when you need to tag a new base image on dockerhub). Or is there something else going on here?

I am not sure if the base image is added to our registry or if the registry only stores the additional layers. I would assume it only stores the additional layers. But I might be wrong. Someone should investigate.

This would make a big difference in terms of number of downloads from DockerHub, yes - an important point to consider.

1 Like

I’m pretty sure all Docker registries store the entire stack of layers, since each layer is based on the previous one.

If you’ve got direct access to your registry this should be easy to verify: Pull an image locally from Docker Hub, make a small change, then push it to your registry. You should see all the layers of the base image being pushed to your registry, not just the small change you made.

2 Likes

I don’t have access to any other registry than DockerHub, but after some research it makes sense that each registry stores its own layers.

Did you ever discuss a strategy based on an LRU cache for the MyBinder registry?

We use something like LRU for the OVH cache. It is more a “unpopular images” remover than LRU.

I think any strategy that is more complicated to explain “it is a cache that might get reset at any moment” will not be practical. For example you’d include a clause for “expire cache if security vulnerabilities happen”. So users would still have to plan for the event that the cache is empty without notice.

What pain point are you trying to solve for?

If this was changed to have special handling for a binder/Dockerfile so that it ignored the repository and only looked at the Dockerfile you no longer need to have separate repos for the env repository and the notebooks, nor do you need the additional complexity of nbgitpuller from

Instead you’d add a binder/Dockerfile containing just FROM example/dependencies:tag to all notebook repos and they’d all take advantage of the shared cached base image.

Alternatively if the Binderhub registry could in some way also be a caching registry then if the build node pulls Docker Hub images via the BinderHub one then you’d get the same benefits without special handling for binder/Dockerfile.

No pain point this time! I was asking out of curiosity. I’ve noticed changes recently (e.g. the switch to OVH+GKE which generated two builds - one for each registry - than this got somehow improved I think, because I haven’t seen this happening recently), and I’m trying to understand how things work. I am learning a lot thanks to you!

1 Like