Creating a specification for reproducible repositories

We’ve had a few general conversations about this, and I’d love to hear the community’s feedback on this idea.

tl;dr: What do people think about trying to define a “specification for reproducible repositories” to standardize across projects and organizations?

Right now, repo2docker has no “official” specification for reproducible repositories. It has heuristics it uses (via Build Packs) to convert a repository’s structure into a Dockerfile (e.g. if it finds environment.yml, then it converts that to a Dockerfile lines that install conda + add the install lines to it). We have largely tried to piggy-back off of other pre-existing practices, and generally just “adopt a new file type” to grow functionality of repo2docker.

As other organizations care more about reproducible data repositories (e.g. @ellisonbg and AWS) , or as more projects build tools that help facilitate this (e.g. @craig-willis and the Whole Tale team), it may be useful to have a specification for a reproducible repository that isn’t strictly attached to the repo2docker build pack system, and that can begin to evolve to capture other use-cases that repo2docker currently doesn’t cover (like necessary hardware, datasets, etc).

If there were a spec that the community (Jupyter or Binder or otherwise) could agree on, it would make it easier for users to move their repositories between different services and tools. If we don’t try to standardize on something, we may run the risk of people building “reproducible workflows” that only work on a particular cloud provider or service.

Curious to hear what people think about this!

1 Like

I know a lot of people that would be interesting in helping to create an open standard for this. It is exactly the type of thing that should be an open standard/spec–the true benefit is only realized if everyone starts to use it. Good news from the binder/repo2docker side of things - in the recent discussions I have been a part of, repo2docker was taken as an informal “starting point” of for understanding what it would look like.

But I think it is much broader than binder, so if we want to gather a community of interested stakeholders, I would begin to post this somewhere more broadly. Maybe we need a “Open Standards and Protocols” category? As an aside, I have been think that it would help us to communication and collaborate on standards/protocols to have a separate GitHub org for those things - to signal that repos on that org are considered to be official standards and protocols of the project. That is off topic here though…

How did the notebook spec / kernel protocol / etc process look? Was that a process done in a similar fashion? Perhaps this is something that the Binder Project can help push forward, though right now I don’t think we have the resources/cycles to push on something like this.

What would the standard look like for repo2docker? One of the choices we made is to reuse existing files (e.g. requirements.txt) instead of having our own. Is this be part of our standard then: “reuse existing files for specifying the dependencies, if you see a file called X this means you need to do Y because the user wanted that installed”?

Basically writing down in words instead of code what repo2docker does (or should be doing)?

The other thing I think would be good to write up (new protocol?) is the format of the link that BinderHub accepts.

I’d think that we could start with codifying the repo2docker specification. E.g., something like

apt.txt
    - mypackge

requirements.txt
    - numpy

...

postBuild
    - blah blah

We could say that this specification applies either to a text file (JSON, yaml, whatever) or to a repository, in which case the keys/values will be based on filenames / contents of the files.

Once there’s a spec, we can extend it in more intentional ways. We could also do something totally different eventually but as a start this would just formalize what we’re already using

I am not sure I understand what you mean. You want to create a new file format that has as content what you posted above? So a YAML format for a file or also files in a repository?

The specification itself wouldn’t be a file format, but the definition of a structure that comprises the “reproducible repository specification”. I just wrote it as yaml above because it’s easy to parse. E.g., here’s the specification website for a brain imaging data structure. The spec is basically documentation saying “here’s how things should be organized”.

Once you have a spec, you could create a new file format that follows the specification (e.g. binder.yaml, and the structure of that file needs to match the specification structure defined by the community).

I’m just saying that right now, we have a specification, but it’s intermingled with a reference implementation in repo2docker. The suggestion was to make the specification explicit and not directly tied to repo2docker. Similar to how the Jupyter Notebook specification should be explicit and not tied directly to nbformat (which I think is what @ellisonbg was getting at about a separate specifications organization)

Ok, I think we are on the same page then. A description with words (and examples) of what is/we think is currently implemented in repo2docker.

I was thinking RFC style document which is why I was confused by the YAML.

Want to actually make it an rfc using the new rfc template in the jep repo?

Maybe this is out of scope or has already been discussed, but would leveraging any of the work or thinking around Cloud Native Buildpacks apply here?

4 Likes

I knew repo2docker’s build packs were based on the build packs I know from heroku, but I hadn’t connected Heroku and CNB together. I am off to do some reading now :slight_smile:

Slightly OT: maybe for future decisions in repo2docker we should look at how the CNB work and handle things. Then copy as much as possible from them. This is the “the less code we have to maintain the better” part of me seeing an opportunity to retire repo2docker :older_man:.

1 Like

Great resources @parente.

When we create standards for Jupyter, it would be nice to do so outside of the Linux Foundation. Open Source Initiative and others would be well worth a look.

2 Likes

Hi all - good discussion and pointers - thank you!

In agreement with Brian’s comment I also see the binder approach (requirements.txt, environment.yml, Dockerfile, …) as a practical starting point to find a description of computational environments for reproducible science.

A major advantage of MyBinder is that it is there and it can be used (and is used) - so we get some feedback on the usefulness of environment specification etc.

In that context, has anybody thought about how hardware requirements could be specified? One could imagine that a repository relies on a GPU for its computation. Or needs a certain amount of memory to execute.

I appreciate this is outside the current MyBinder instance - but I thought maybe somebody here knows about ongoing efforts in this direction. I think this will be relevant for emerging projects such as the European Open Science Cloud (EOSC).

I think we should ponder how to specify hardware resources. I also wonder if there is much value to being able to do it. We’d have to find some generic way of specifying things that works for the most popular kinds of hardware a repository can need. By doing so would not be able to specify “exotic” hardware needs which (I guess ) are those that most need this most.

Specifying basic RAM and CPU is kinda not so interesting. At least as someone who used to use shared compute resources my experience is that most people specify way too high resources (because it is hard to know in advance and “just in case”). Given this experience, as an operator I wouldn’t trust users to make good choices here that could be used for scheduling/resource allocation. I’d probably still overcommit resources (assuming users won’t use what they asked for).

For more complex things, maybe GPUs, people will have a better idea of what they actually need. This is good but also not because they want to have a “Nvidia V100 Max Awesome” not just a “GPU”. So we need a spec that is able to cope with such very specific requests.

So I think we should focus on being able to cover simple hardware requests first because that is much easier to do. Punting things like “I need a FPGA for my code!” to the humans with a bit of documentation that says “we host three BinderHubs. One with CPUs, one with GPUs and one with FPGAs. You as repository author know which one you need, please embed the corresponding link in your README.” Then we can see how well that works and figure out how to specify these “exotic” hardware requests.

This issue got me thinking:

What’s the minimum base requirements you can define for the repo2docker spec to be useful? How tightly coupled is the repo2docker spec tied to whatever specifications are used for the underlying infrastructure? Would something like mybinder have to support multiple base-images?

It seems reasonable to me to try keeping this relatively granular as a first shot. I agree w/ @betatim that things could get very messy if we tried to cover all the things that can change w/ hardware. I’m always a fan of the “we want the tool to be helpful, even if it doesn’t 100% solve everybody’s problems w/ reproducibility” approach.

I’d love to hear @craig-willis’s thoughts on this as I’m sure they’ve considered it in the wholetale world too

I think the key for me for repo2docker is the mission of “automating existing best practices” as much as possible, and not defining any new standards. I think we’ve missed the goal if we define a new format for reproducible environments, or if people are building repos that only work reliably when built with repo2docker. There should be nothing in repo2docker that doesn’t make sense entirely outside the repo2docker context. So to that end, @betatim’s proposal of the repo2docker “standard” being documenting what we do in words, i.e. “see requirements.txt, that means 1. need Python, default 3.7 (or runtime.txt) 2. pip install -r requirements.txt”.

I think an RFC-style JEP defining the spec we follow is a good idea, especially codifying repo2docker as an “automation of existing standards, here’s the currently supported list, and assumptions/defaults where the standards leave something ambiguous (e.g. default Python version)”.

1 Like

+💯! Thanks for saying it and reminding us about this. Maybe codifying this sentence so that we don’t (keep) forgetting about it or get tempted to create new standard instead of following where others have gone before.

repo2docker - syntactic sugar for your reproducible repos (or something slightly more catchy) :slight_smile:

I think that’s an important point to re-emphasize - perhaps we should add a “guiding principles of the specification” section? Something like:

  1. Utilize pre-existing standards and specifications wherever possible
  2. Any implementation of the spec should be understandable by a human
  3. The specification should use tools that are standard in their respective communities
  4. The specification shouldn’t require any technology that is unique to repo2docker

I’m +1 on an RFC for this. If we wanna go that route, I think somebody should volunteer to be the “shephard” that moves conversation forward.

Is there anything that could go into the spec to avoid issues such as


Either regarding how requirements are specified, or perhaps expectations of external repositories that could be implemented in future?