Allow for multiple, different dependencies per repository

rbavery · April 22, 2020, 9:36pm

Moving discussion from this github issue: Allow for multiple, different dependencies per repository · Issue #555 · jupyterhub/binderhub · GitHub

The Problem

Thanks for the clarification @kikocorreoso ! I think your best bet is to try and install a list of dependencies you’ll conceivably need for most posts. Binder will turn the environment into a Docker image, and installing extra dependencies that aren’t used all the time generally isn’t a big issue (unless the image is really big).

Projects/posts will inevitably have non-overlapping dependencies and may require different versions of the same package, especially (from my experience) where geospatial python packages are required. So a single environment.yml file won’t be able to serve all posts for our use case. Branches are tricky to use for our case where users would need to merge in posts but not the environment file, and there would need to be a branch for every post, which would be hard to organize/update/manage.

We (our research group at UCSB) are interested in spinning up a submission-based blog to showcase environmental data science projects/packages/viz. Ideally we’d like to be able to create binder links while specifying an (optional) environment or Dockerfile. This could either be relative to the root of a github repo or a URL to the file, doesn’t matter.

Possible Solution

I’m wondering if the create Binder page could have an optional “path to an environment file” argument like there is an optional “path to a notebook file” argument. The optional “path to an environment file” arg could override Binder from looking and selecting a repo’s Dockerfile/environment.yaml/etc., only if the optional arg is supplied. I’d be interested in working on this if the maintainers are open to it.

Such a feature would enable us to host totally reproducible blog posts from a variety of different authors and handle different dependencies and package versions for each post.

Why alternatives aren’t optimal for code blogs with multiple, distinct dependencies

I appreciate the suggestions in the github issue on how to work around the “1 repo 1 environment” model. Unfortunately in our case, it doesn’t suit our needs.

In a pinch, for users that write a script requiring some very specific dependency, you could also consider having them explicitly install that dependency at the top of the post. This could be informative for readers anyway, as often it’s useful to highlight things that are not part of the “standard scipy stack”. Think that’d work?

This would work for some cases but not others. For example, rasterio is difficult to install with pip, since it depends on hefty C libraries like GDAL that must be installed separately. conda is the easiest way to make these installs happen. installing with conda from a notebook cell is possible, but not intuitive. It also slows down the time to start experimenting since the packages must be downloaded and solved:

# Install a conda package in the current Jupyter kernel
import sys
!conda install --yes --prefix {sys.prefix} numpy

I had to google the above snippet and found it in this helpful article: Installing Python Packages from a Jupyter Notebook | Pythonic Perambulations

But sometimes linux system dependencies hat conda can’t install are required, like tzdata for working with time series (I’ve personally run into this case).

In these cases, which I’ve found are not rare when you use domain specific scientific packages, it’d be very useful to be able to supply a path/url to a Dockerfile or environment.yml when creating the Binder link. Packages like rasterio and geopandas are widely used, but working code often depends on particular versions.

This would enable applications like fastpages or any other personal code blog to support totally reproducible blog posts, since each post would have a binder built with that post’s minimal environment. I think this would be a game changer for reproducible, self published science/analysis.

I’m eager to hear what other folks think! Thanks @manics for suggesting I post here.

betatim · April 23, 2020, 5:03am

(repost from the issue)

I think https://github.com/jupyterhub/binderhub/issues/555#issuecomment-390112357 remains the best compromise.

An alternative option is the idea of “binder boxes” or splitting the environment from the content. I did a bit of searching on the forum and Tip: embed custom github content in a Binder link with nbgitpuller was the best I could find. There are more threads to read though. For your use case you’d have several repos that define the environments (more than one but less than number of blog posts0 and then pull in the notebook that you want (per blog post) via nbgitpuller or the like.

I realise both are workarounds but I think they represent the best trade-off between maintenance complexity, usability and functionality.

Honourable mention of the Kaggle docker image which seems to have every single data science related library under the sun installed at the same time. A post bout using it as a binder box. So maybe getting all libraries for all posts installed at the same time is physically possible. Though every time I look at the Kaggle image I wonder how it is possible that it works.

Topic		Replies	Views
A new open-source Jupyter blogging system hosted on GitHub Pages Notebook announcement , community	7	3064	February 17, 2020
Creating a library of notebooks each being individually executable General	5	1485	June 12, 2020
"reproducible" binder environments with repo2docker, dockerhub and nbgitpuller discuss	10	2139	August 7, 2019
Repo2Docker: make it easy to start from arbitrary docker image discuss	16	3450	April 27, 2019
Creating a binder for a repository containing both setup.py and environment.yaml repo help help-wanted	4	1734	March 8, 2021

Allow for multiple, different dependencies per repository

The Problem

Possible Solution

Why alternatives aren’t optimal for code blogs with multiple, distinct dependencies

Related topics