Possible to unpack Zenodo .zip files from manually preserved Zenodo archives?

matthewfeickert · June 30, 2021, 7:24am

Hi. I’m not sure if this is something that falls under Binderhub’s control, or if this is something that I should talk to the Zenodo devs about, so I thought I’d ask here first:

Is it possible to have Binder detect if the Zenodo .zip archive it has access to has been properly unpacked?

Binder has the fantastic feature of being able to create Binder images from a Zenodo DOI. Example: Binder

for .

However, this nice functionality will only unpack the Zenodo archive if the Zenodo archive was made from the Zenodo GitHub importer. If instead, the archive was manually uploaded to Zenodo (like mplhep: bridging Matplotlib and HEP from PyHEP 2020 ) then the resulting valid Zenodo DOI can not be used to create a useful Binder image.
Instead, the built Binder image will launch the Jupyter server with the archive zip file being the only file in the server top level directory, as opposed to being unpacked like the importer method.

Example:

Is this the intended behavior? If so, can someone point me in an area where I can better understand why? If not, can this get fixed so that all valid Zenodo DOIs get treated the same (happy to help if possible/useful)? If this behavior could be changed that would be pretty huge, as that would also ensure that past physics workshops like PyHEP 2020 could have most of the projects in its Zenodo community collection be runnable on Binder far into the future given Zenodo’s archive stability.

For clarity, example links are repeated here:

Working as expected launching into archive’s contents (Zenodo entry generated by importer tool): Binder
Not working as expected (entry manually uploaded to Zenodo): Binder

nuest · June 30, 2021, 8:02am

I agree it should be possible for Binder to handle manually created Zenodo records, though the burden lies then with the author to make sure everything works - that is a lot easier if the GitHub repo that the Zenodo record is based on can be tested on BinderHub.

The code starting in this line repo2docker/zenodo.py at bbb88aceb8316957b6f697d907d2ffc1d8d57c8f · jupyterhub/repo2docker · GitHub would have to be changed. First, the record type might not have to be limited to “software”, and second, the extraction of the ZIP archive needs to be implemented (not sure this would be caught later in the process).

There is a caveat though: if a Zenodo record is created from a GitHub repository, there is an implicit size limit. Maybe we need a check here to ensure that we’re not fetching 50GB large Zenodo records? Should the BinderHub operator be able to set a limit here?

matthewfeickert · July 1, 2021, 9:49pm

that is a lot easier if the GitHub repo that the Zenodo record is based on can be tested on BinderHub.

Yeah, I 100% agree that this using the Zenodo GitHub importer is the way to go here. Though when trying to make a “reproducible workshop” like the PyHEP series we’ve found that while most presenters are willing to follow steps we have some that just get busy and never finish (just like how conference proceedings can linger) but we’d still like to include them in the Zenodo community for that year’s workshop.

The code starting in this line repo2docker/zenodo.py at bbb88aceb8316957b6f697d907d2ffc1d8d57c8f · jupyterhub/repo2docker · GitHub would have to be changed. First, the record type might not have to be limited to “software”, and second, the extraction of the ZIP archive needs to be implemented (not sure this would be caught later in the process).

Cool! This is a great starting point. Is there a technical reason why it would need to be limited to “software” though? The PyHEP 2020 Workshop Zenodo collection has all of the archives as “presentation”.

Maybe we need a check here to ensure that we’re not fetching 50GB large Zenodo records? Should the BinderHub operator be able to set a limit here?

Yeah this definitely seems reasonable / a good idea. Having BinderHubs be able to place limits on the size of the archive it is trying to containerize makes a lot of sense to me.

nuest · July 2, 2021, 1:20pm

I don’t know why there is the limit for “software”, maybe it’s simply to filter for repos that are likely from GitHub?

@betatim it seems you introduced that check in the first implementation at Add basic Zenodo content provider · jupyterhub/repo2docker@dce6c1e · GitHub - do you recall why you added that check?

Anton_Akhmerov · October 22, 2021, 10:05am

Just to report back here, always unpack a single zenodo zip by akhmerov · Pull Request #1043 · jupyterhub/repo2docker · GitHub addressed this feature.

matthewfeickert · October 22, 2021, 2:47pm

Fantastic. Thank you for implementing that!

Topic		Replies	Views
Blog post about Zenodo + Binder integration discuss	0	701	June 18, 2019
Repo2DockerSpawner - alternative version JupyterHub	23	2660	August 3, 2020
Jupyter-archive: Make, download and extract archive files JupyterLab	17	14240	October 16, 2019
Jovian.ml increased usage in Binder General	8	1886	October 3, 2020
Binder+JupyterHub Activity Round Up - Week 1 Binder community	0	855	November 19, 2019

Possible to unpack Zenodo .zip files from manually preserved Zenodo archives?

Related topics