Possible to unpack Zenodo .zip files from manually preserved Zenodo archives?

Hi. I’m not sure if this is something that falls under Binderhub’s control, or if this is something that I should talk to the Zenodo devs about, so I thought I’d ask here first:

Is it possible to have Binder detect if the Zenodo .zip archive it has access to has been properly unpacked?

Binder has the fantastic feature of being able to create Binder images from a Zenodo DOI. Example: Binder

Binder

for DOI.

However, this nice functionality will only unpack the Zenodo archive if the Zenodo archive was made from the Zenodo GitHub importer. If instead, the archive was manually uploaded to Zenodo (like mplhep: bridging Matplotlib and HEP from PyHEP 2020 DOI) then the resulting valid Zenodo DOI can not be used to create a useful Binder image.
Instead, the built Binder image will launch the Jupyter server with the archive zip file being the only file in the server top level directory, as opposed to being unpacked like the importer method.

Example:

Is this the intended behavior? If so, can someone point me in an area where I can better understand why? If not, can this get fixed so that all valid Zenodo DOIs get treated the same (happy to help if possible/useful)? If this behavior could be changed that would be pretty huge, as that would also ensure that past physics workshops like PyHEP 2020 could have most of the projects in its Zenodo community collection be runnable on Binder far into the future given Zenodo’s archive stability.

For clarity, example links are repeated here:

  • Working as expected launching into archive’s contents (Zenodo entry generated by importer tool): Binder
  • Not working as expected (entry manually uploaded to Zenodo): Binder
1 Like

I agree it should be possible for Binder to handle manually created Zenodo records, though the burden lies then with the author to make sure everything works - that is a lot easier if the GitHub repo that the Zenodo record is based on can be tested on BinderHub.

The code starting in this line repo2docker/zenodo.py at bbb88aceb8316957b6f697d907d2ffc1d8d57c8f · jupyterhub/repo2docker · GitHub would have to be changed. First, the record type might not have to be limited to “software”, and second, the extraction of the ZIP archive needs to be implemented (not sure this would be caught later in the process).

There is a caveat though: if a Zenodo record is created from a GitHub repository, there is an implicit size limit. Maybe we need a check here to ensure that we’re not fetching 50GB large Zenodo records? Should the BinderHub operator be able to set a limit here?

1 Like

that is a lot easier if the GitHub repo that the Zenodo record is based on can be tested on BinderHub.

Yeah, I 100% agree that this using the Zenodo GitHub importer is the way to go here. Though when trying to make a “reproducible workshop” like the PyHEP series we’ve found that while most presenters are willing to follow steps we have some that just get busy and never finish (just like how conference proceedings can linger) but we’d still like to include them in the Zenodo community for that year’s workshop.

The code starting in this line repo2docker/zenodo.py at bbb88aceb8316957b6f697d907d2ffc1d8d57c8f · jupyterhub/repo2docker · GitHub would have to be changed. First, the record type might not have to be limited to “software”, and second, the extraction of the ZIP archive needs to be implemented (not sure this would be caught later in the process).

Cool! This is a great starting point. Is there a technical reason why it would need to be limited to “software” though? The PyHEP 2020 Workshop Zenodo collection has all of the archives as “presentation”.

Maybe we need a check here to ensure that we’re not fetching 50GB large Zenodo records? Should the BinderHub operator be able to set a limit here?

Yeah this definitely seems reasonable / a good idea. Having BinderHubs be able to place limits on the size of the archive it is trying to containerize makes a lot of sense to me.

I don’t know why there is the limit for “software”, maybe it’s simply to filter for repos that are likely from GitHub?

@betatim it seems you introduced that check in the first implementation at Add basic Zenodo content provider · jupyterhub/repo2docker@dce6c1e · GitHub - do you recall why you added that check?