Making open data sources accessible in JupyterLab

Hi, as part of a project we are involved in at GWDG, we want to make various data sources easily accessible from within JupyterLab, and I am curious if anybody else is working on something similar. Such data sources could be open access sources (like census data, weather data, Kaggle datasets, …) but also research data repositories, or a cloud storage folder.

We’ve found that one speed bump when using Jupyter is just getting data files into the system. Notebooks often have a step where you ‘wget’ files from some server, or an admin has to copy the data to a shared directory.

I am prototyping an extension that lets you search for data sources from within JupyterLab. Then you can download the files with a click to your working directory. If it is a larger dataset, you can mount it as a directory (e.g. using HttpFS). Later one could add the ability to mount datasets from OwnCloud, via SSH, and so on.

A second step would be to automate the fetching of data. Like in binder where you have an environment.yml that sets up the kernel and the dependencies, you could imagine a datasources.yml that says which data sources to mount when checking out a repository! (Something similar exists with Curious Containers RED (Reproducable Experiment Description), although it involves whole containers and is more tailored towards research instead of teaching.)

Let me know what you think and if something like this would be useful to you!

3 Likes

Like in binder where you have an environment.yml

For datasets of reasonable size, there is precedent for using exactly environment.yml for datasets and models. Here’s an example of automating the publishing of spacy-models.

Using an existing package manager (not a computer manager like docker) architecture allows for:

  • reusing existing development and automation workflows
  • simplifying downstream user confuguration needs
    • a binder with a “well-behaved” environment.yml gets cached even if other content changes
  • requirements for the versions of parsers, clients, etc. can be included

The conda(-forge) ecosystem is semi-well-suited to this approach as it:

  • uses hard links such that multiple copies of the upstream data result in no extra disk usage
  • it supports multiple namespaces with priority
    • if appropriate, conda-forge is already a default on binder/docker-stacks, and will host almost anything
      • as long as it doesn’t change too fast
    • if not, it’s still free to host on anaconda.org
      • or host a dumb-as-rocks channel on any static host
  • has multiple implementations, including conda (the reference), mamba (faster) and micromamba (no python, smaller and even faster)
  • “soft” dependencies can be used with run_constrained e.g. the python parser library should be >=X1,<X2, the r parser should be >=Y1,<Y2
3 Likes

That’s a brilliant solution! Especially for datasets that are already maintained in Git like github.com/fivethirtyeight/data it should be easy to make a conda package. And you get versioning for free.

Still, I have to deal with datasets that are too big to comfortably put in Git, or to one-off to put in conda (although I might just try a custom channel for that case). Or something massive like https://atlas-opendata.web.cern.ch . There is a long tail of interesing data out there, and it would be nice to consume it without much additional effort on the data holders’ side.

1 Like

If it is a larger dataset, you can mount it as a directory (e.g. using HttpFS). Later one could add the ability to mount datasets from OwnCloud, via SSH, and so on.

As long as there is good metadata associated with the upstream one could still publish a package that provided that (and/or a single “row” of data, especially if the format isn’t self-describing), the LICENSE, the hashes of the data, and crucially, a means and location from which to download it. Having a couple ways to download something would be best, probably, e.g. a peer to peer (torrent, ipfs, etc) all the way down to stdlib tools.

For things that are “bigger than a comfy conda package but smaller than binder”, a conda package can defer certain actions until after the package (and all its dependencies) have been installed… while frowned upon, such post-link scripts definitely still work, and are required for a number of packages. This would make the conda-forge build feasible, and would still happen atomically inside the install action on binder at an optimum cache time.

While a really good data package manager wouldn’t limit what language you consumed it in, specifically for python the fsspec and intake tools might be good, and are pretty much sister products to conda.

I might just try a custom channel for that case

This is actually a pretty interesting case, and may be worth bringing up on the conda forum. At scale, such an effort would want all the automation that the conda-forge/conda-smithy/autotick-bot has, but maybe it needs a data-forge sister for this kind of use case, which can rely on conda-forge for tools.

2 Likes