Repo2docker builds don't seem to use docker layers?

When I add a single file to, say, requirements.txt repo2docker seems to redo all the docker layers in the build, taking forever.

I’d expect it to be able to reuse layers before the one that does the pip install with the requirements.txt.

Is that the expected behavior? if so, any ideas what might be causing it to fail and rebuild everything everytime?

I’m using the --editable flag, if that’s relevant.

The requirements.txt file (and all other files) need to be copied into the image, which will invalidate the cached layers after that copy.

In theory it would be possible to rewrite repo2docker to break every script into many smaller scripts, and to only copy the relevant file(s) when needed. The downside would be more layers (maybe 100?) and an increased maintenance burden.

You’re probably better off developing your environment outside repo2docker, or perhaps running pip install .... inside a terminal after launching a repo with r2d, working out your dependencies, and only updating requirements.txt at the end.

Ok, thanks, that helps. I think I’ve been doing this wrong for years :slight_smile:

My new approach will be to fire up a basic r2d machine (with --editable) and customize the config files there. I guess that means that I need to know the commands to update the image once I add a package to requirements.txt or apt.txt etc. Not too difficult :). Although, thinking just about apt-get that seems hard to do the way that r2d does it (when one is inside the container) Perhaps some of that logic for using the r2d config files could be extracted and become usable from inside a container.

Today I ran into an issue where the underlying problem was an incompatibility with Python 3.10, so that would require backing out of the container and re-building.

FWIW, I’d imagine that re-useable layers would really help with Binder build costs and time but I’m just waving my hands

The few bespoke layers for apt, conda, and pip are a bit finicky, but --editable breaks everything, because r2d doesn’t (and really, shouldn’t) try predict whether what changed in a commit will change how that package-in-repo would behave when installed.

To control the python version and a bunch of non-editable PyPI installs, the “least special files” is probably an ./.binder/environment.yml that uses the #/dependencies/pip key… but as soon as it uses a -e . for the package-in-repo, the same condition described above must apply. A ./.binder/runtime.txt might work this way as well, but I’ve pretty much avoided messing with that… these get resolved, over time, and might as well put it all in one file.

Given the above, a ./.binder/postBuild can come along after the “base” environment is set up, and pip install -e ., which will always be executed per-commit, but won’t invalidate the environment.yml/runtime.txt or requirements.txt layers.

A finer-grained concept might be a postBuild.d/00-do-this and 01-do-that, executed in some sane order, but to get any benefit, would have to declare the (globs of) files that need to be COPYed, which is painful to generalize without resorting to some new file describing these relationships. But at this point, it’s re-invented things like make and multi-stage FROM builds, in a way that won’t really work anywhere else… runtime.txt and postBuild are already as far off the beaten path as it probably should go, given that relatively closed-form, portable solutions like environment.yml exist.

1 Like