Generated Dockerfile v repo2docker as archival format

I wanted to follow up on part of the discussion in repo2docker#533 about using the generated Dockerfile as a long-term archival format when publishing to a research repository (e.g., Dataverse network, DataONE members, etc). In earlier discussions about the Odum CoRe2 project, they mentioned using repo2docker to generate a Dockerfile that would be published with the dataset in Harvard’s Dataverse. The Whole Tale project was considering adopting a similar approach. @yuvipanda pointed me to the documentation and repo2docker#202, noting that the generated Dockerfile is for debug purposes only and can’t be used to regenerate the image.

I’d be curious to hear any thoughts on the merits of using a generated Dockerfile versus repo2docker as a long-term archival format (for examples of archival formats, see something like NARA). My initial thinking is that the Dockerfile would be preferable, since it is widely used (i.e., defacto standard) and documented, can be understood without knowing any additional structure and requires only Docker to regenerate an image. That said, I could be convinced that repo2docker itself is equally acceptable. Instead of including the generated Dockerfile, I would likely include some sort of system-generated script or README instructing users how to run via repo2docker and linking them to documentation to understand the configuration/dependencies.

Thanks for bringing this up, @craig-willis!

The way I think of it is:

  1. repo2docker reads files that are already ‘standard’ in their environments - such as requirements.txt or environment.yml (or just a Dockerfile), wherever possible. This is really the biggest thing you can do about long term archival - depend on standards that might have multiple implementations.
  2. The few things that are repo2docker specific (apt.txt, postBuild, start) should be simple to replicate, and documented.

I think the way forward here is to define a formal repo2docker standard that can be pointed to from an archive. repo2docker itself is one particular implementation that ties together different other standards.

This is separate from adding a system-generated script, which you can still do independently. I think the best way to do that is to actually specify a ‘docker run’ command that will use the correct version of repo2docker to produce your final image - this way your user’s dependencies are just docker, so it’s similar to just using a Dockerfile.

related to that is Add support for pinning repo2docker version · Issue #490 · jupyterhub/repo2docker · GitHub Though the current idea for how to pin versions means you will need access to old builds of repo2docker so that you can run them.

Maybe a question to consider is what the files are being stored for. The Dockerfile that repo2docker outputs currently can’t be fed into something that will work with docker build straight away but a human can inspect it and make (very) educated guesses as to what was meant to happen or how to transform it. In that situation I would say preserving this Dockerfile.log(??) is cheap, useful and adds value compared to storing the binary image.

Even if the Dockerfile would run with docker run today I don’t expect it would still work in 10years time. It would probably require some fixing up, so why not start with one that is known to need some fixing up?

Thank you both for the great feedback.

I’ve been thinking about how to respond to this and the Odum AJPS use case seems like a good illustration.

A researcher submits a quantitative paper to the American Journal for Political Science. After peer review, the paper is provisionally accepted pending replication of the analysis. Today, the researcher uploads their data and code to Harvard’s Dataverse then communicates with a curator/statistician at Odum who runs the code and confirms the results are as expected. Once this process is complete, the paper is accepted and the data/code published in Dataverse. After publication, a future researcher re-runs the analysis either for replication or education purposes. Today, the system requirements are typically captured in a README file. I’m not part of the Odum CoRe2 project, but my understanding is that they are planning to use repo2docker to improve this workflow. The researcher and curator/statistician would collaborate on the creation of a repo (Git or Gitlab). One idea was to publish the Dockerfile itself along with the code and data as the record of the replicated environment.

My view is that archiving the Dockerfile/repo2docker config serves the following purposes:

  1. As record of the computational environment at the time of publication/peer review (“it worked under these settings on this date”)
  2. To support re-building and re-running the image, if the software/systems are still around and supported. This reduces barriers to replicating/reproducing the study over time (<3 years)
  3. To support understanding the required environment without necessarily re-building or if the software/systems are no longer supported (10 years later)

A tangential point is I think there’s an opportunity to work with the archive community to define a set of practices for repositories that want to support this capability. Odum is an early-adopter.

Agreed, and this is a great benefit of using repo2docker. I’d also add the runtime.txt and in-progress repo2docker.version, which is also a sort of proxy for the base OS version.

In repo2docker#170 there’s discussion of encouraging users to pin versions (e.g., pip freeze) for the reproducibility-minded. For the Odum workflow, I might recommend creating the runtime.txt, running pip freeze (or equivalent), specifying versions in apt.txt, and including TBD repo2docker.version file. Maybe another option would be to have a utility that “freezes” the repo2docker environment, generating the files recommended for long-term reproducibility. Until recently, I’ve been thinking of the generated Dockerfile loosely as this – a sort of repo2docker.lock.