We’ve had a few general conversations about this, and I’d love to hear the community’s feedback on this idea.
tl;dr: What do people think about trying to define a “specification for reproducible repositories” to standardize across projects and organizations?
Right now, repo2docker has no “official” specification for reproducible repositories. It has heuristics it uses (via Build Packs) to convert a repository’s structure into a Dockerfile (e.g. if it finds
environment.yml, then it converts that to a Dockerfile lines that install conda + add the install lines to it). We have largely tried to piggy-back off of other pre-existing practices, and generally just “adopt a new file type” to grow functionality of repo2docker.
As other organizations care more about reproducible data repositories (e.g. @ellisonbg and AWS) , or as more projects build tools that help facilitate this (e.g. @craig-willis and the Whole Tale team), it may be useful to have a specification for a reproducible repository that isn’t strictly attached to the repo2docker build pack system, and that can begin to evolve to capture other use-cases that repo2docker currently doesn’t cover (like necessary hardware, datasets, etc).
If there were a spec that the community (Jupyter or Binder or otherwise) could agree on, it would make it easier for users to move their repositories between different services and tools. If we don’t try to standardize on something, we may run the risk of people building “reproducible workflows” that only work on a particular cloud provider or service.
Curious to hear what people think about this!