Dataverse Community Meeting short talk?

I’m participating in a panel at the upcoming virtual Dataverse Community Meeting (https://projects.iq.harvard.edu/dcm2020/agenda) and we are looking for a guest speaker who would be interested in talking briefly (~10 minutes + Q&A) about the current Binder vision as it relates to preserving computational environments in research repositories.

Breakout Session: Encapsulation
https://projects.iq.harvard.edu/dcm2020/agenda
June 18th, 3-4:30pm ET
In this session, we will have a discussion on what is an adequate and sustainable way to deposit and store virtual containers and computational workflows. This is a complicated issue because, on the one hand, it is not economical to store whole Docker images in Dataverse, and on the other hand, Dockerfiles are prone to errors. We will use recent development on Dataverse described in [2005.02985] Advancing computational reproducibility in the Dataverse data repository platform as a starting point for this session.

These short talks would touch on how they relate to:

  1. Computational reproducibility
  2. Preservation
  3. Discoverability and reuse (FAIR etc)
  4. Potential ties and applications in data repositories (such as Dataverse, Zenodo)

Sorry for the short notice. Let me know if interested.

2 Likes

I’d be interested.

Could you explain a bit about the format? The 10min discussion is a discussion or a short talk (with slides?)?

2 Likes

I +1 Tim being interested :slight_smile:

2 Likes

Great, thanks @betatim (and @choldgraf).

We’re thinking 10min short talks with slides from three different presenters that inform a broader group discussion. From my understanding, Community Meeting attendees are largely Dataverse curators or operators (there are 50+ installations worldwide), so the repository integration perspective is key.

1 Like

Hi all,

I think that would be great. I recently saw this paper https://osf.io/fsd7t/, and I think it would be interesting to hear how these guidelines could be applied to research dissemination in a data repository. In particular, the general sentiment is that preserving the whole image is much better than preserving a Dockerfile, so it would be good to hear if it is possible to achieve comparable outcomes with a ‘good’ Dockerfile. Also, could somehow an automatic creation of a Dockerfile (with, for example, Code Ocean or Whole Tale), in contrast to user-generated ones, “guarantee” reproducibility?

As Craig said, there will be attendees who probably won’t know much about containers so a general introduction would be good, but also this session is supposed to drive some of the future developments in Dataverse, so expert opinions are also welcome :slight_smile:

Hi again,

I forgot to mention - we need to confirm the talks ASAP so we can plan the agenda, so please let us know if you are still interested in this, and if we can count on you. Also, I apologize for the short notice.

I’d be happy to give a short talk.

Thanks for clarifying things on the format as well as ideas for the direction that the audience is interested in.

Could you explain a bit what “the repository integration perspective is key” means? What are we trying to integrate into a system like dataverse? It feels like I am missing something because storing container images or Dockerfiles seems no different from storing other kinds research data. (They are all just bits in the end.) You upload it to the repository, add metadata, done. The fact that this is a very simplistic opinion tells you (and me) that I am missing something :slight_smile:

Sorry about that, I should defer to Ana since she’s the organizer, but I can try to clarify what I meant.

  • What should a Dataverse developer, repository operator, or curator know about Binder to enable researchers to publish binders and support related capabilities?
  • Are there any ongoing or new developments that would be of interest to repository operators, developers, or curators working trying to support publishing/archiving of binders?

Sure, it’s all bits, but they enable different features/capabilities and there are some assumptions, I think. A few thoughts (from my limited perspective ):

  • Binder supports building and running images published to repositories like Dataverse and Zenodo, but relies on others to get them into the repository (and needs community contribution). For example, publishing to Zenodo uses the Github integration, but there’s currently no comparable way to publish to Dataverse. I think this is part of the project philosophy – it’s up to interested developers or users to figure out how to get their binders into the repository.
  • The focus seems to be primarily on the repo2docker configuration files with the assumption that the image would be rebuilt via repo2docker. Is it important to also support preserving the built image at the time of deposit (as in the tarball buildpack https://github.com/jupyter/repo2docker/pull/778)? Is there any other work/thinking in this direction?
1 Like

That’s amazing! I’m looking forward to catching up with you, Tim. :slight_smile: All that Craig is saying is right, and here are the same ideas in a slightly different format:

So there is already an integration between Binder and Dataverse. You can launch an environment in mybinder.org if you have a Dataverse DOI. However, most of the files are not re-executable on Binder because Dataverse does not typically encourage capturing the environment (like requirement.txt or Dockerfiles). So, one question would be how we could make these existing replication packages executable (this is a hard question)? And, what should be incorporated in Dataverse, so that they are executable in Binder?

Another way to look at this is: hypothetically, Dataverse can choose to keep whole images or Dockerfiles to capture runtime env. Dockerfiles can be volatile, non deterministic, they construct environments on the fly if software versions are not fixed (ie, always use the latest), so when it comes to storing bits, it’s better to store the whole images; and Dataverse cares about preserving research and enabling reproducibility after some time period. So should Dataverse immediately look into creating a docker registry, or it would be somehow possible to improve Dockerfiles and still have good results when it comes to reproducibility only by storing them (ie if Dockerfile is automatically generated and not by a user)?

I hope this makes sense. These are some pointers about the direction we are thinking. But of course, these are also really hard questions and we don’t hope to get all the answers at this session :slight_smile:

1 Like

Thanks a lot for these! I will think about them and see to which of these (and others) I can think up a useful opinion or experience!

See you online :smiley: