Guix-Jupyter: Towards self-contained, reproducible notebooks

I’m happy to announce the first beta release of Guix-Jupyter, a kernel that allows users to annotate their notebooks with the list of software dependencies the notebook requires, and have them deployed in a reproducible fashion through GNU Guix:

https://hpc.guix.info/blog/2019/10/towards-reproducible-jupyter-notebooks/

As I wrote in this post, we’re more familiar with Guix than with Jupyter and we’d very much welcome your feedback on this approach!

Thanks,
Ludo’.

This is really cool! Thank you for sharing this approach. I have been chatting with my coworkers at Quansight about needing something like this for a while now. In our discussions, we had been thinking about storing this in the notebook metadata instead of in the cells. Did you consider that option?

Nope, we hadn’t considered that option! I guess storing deployment info in the notebook metadata would work too, and would have the advantage of not “getting in the way”.

One downside, is that the UI to access deployment info would be less convenient (you’d have to edit JSON right from the “Edit Notebook Metadata”, right?). More importantly, this would be at the notebook level rather than the cell level—“good enough” for some use cases, but maybe not all.

Thoughts?

1 Like

Yeah I think you would wanna build some custom UI for this in JupyterLab. You can add things to the cell tools so you could surface this more easily. Or add a button to edit it in the toolbar.

Ah yeah I wasn’t aware of the cell level use cases. Could you explain one?

Thanks, I’ll take a look at the Jupyter Lab goodies (so far I was targeting Notebook).

One might want to have a multilingual notebook, as mentioned in the blog post, or to run code in different environments, for instance with different sets of dependencies. It remains to be seen how it plays out in practice, but I suspect this added flexibility can be useful.

Let me know if you have any questions there, happy to help since I am also interested in this use case.

That makes sens., I think we were thinking just to keep things simple initially, so just allow one language per notebook cc @tonyfast @yuvipanda

FWIW in binder we are definitely interested in this too, we’ve had a few conversations about embedding metadata in the notebook and letting binder use that, but nothing tangible has come out of it yet (mostly just an “hours in the day” kinda thing)

Nice, it would be great to see some convergence here! Though like I wrote in the blog post, I think the kind of information Binder would need to store is likely higher-level and more opaque than the Guix bits, which precisely describe a software package dependency graph.

It seems to be conflating two separate things - notebooks / analytics scripts and packaging the environments required to run them.

conda is the best tool we have for creating reproducible analytics environments and as you point out containers solve the “system software” issue.

[conda] is generally not very good at reproducing software environments at different points in time or space

In your linked tweet it sounds like the environment wasn’t saved with explicit specs. Whilst it’s appropriate to (optimistically) loosely pin your dependency versions in your meta.yaml to ensure reproducibility of an environment you should export an explicit env-spec.txt which exactly pins down the dependency versions, build numbers and even channels.

Creating a docker container with this environment ensures replicability and publishing the explicit env-spec.txt allows the environment to be reproduced locally.

Our internal CI/CD automatically builds docker images and as part of that bakes in the env-spec.txt for the environment so that it’s always available. In the case of web-app containers the env-spec.txt is made available on a /api/env-spec endpoint.

Packaging is complicated but that can be alleviated by automation exactly as is done with Binder. Other than not listing the explicit specs for the environment I’m not sure what reproducibility issues Binder doesn’t solve?

Last but not least, we still haven’t solved the core issue, which is that notebooks are not self-contained: they do not describe the dependencies they need.

I think this is where we disagree - I don’t think they should. IMHO that’s the job of a proper package manager, conda and package specification DSL - meta.yaml

Hello,

Thanks for the pointer. You are right that “pinning” exact versions with explicit specs greatly improves reproducibility.

It remains that Conda falls short when it comes to capturing the complete dependency graph. This is in contract with the functional approach of GNU Guix and Nix, where the complete dependency graph is captured — down to the “compiler’s compiler” — and each package in the graph can be rebuilt at any time, with a bit-for-bit identical result, thanks to reproducible builds. In other words, Guix provides bitwise software environment reproducibility without relying on an archive of pre-built binaries on the project’s server.

I hope this clarifies the distinction I’m making when it comes to reproducibility! Others wrote about their experience with Guix and Conda in the context of genomics.

I agree that deployment is the job of a package manager; Guix-Jupyter uses GNU Guix for that task.

Perhaps what we disagree on is where dependency meta-data should be stored. Guix-Jupyter is an experiment to embed dependency information in annotations directly in the notebook, making it easy to share notebooks and have their environment automatically deployed without further ado.

Storing it in a separate file such as meta.yaml is another option, but I would argue that dependency info should not be a “second-class citizen” given that it defines the results of the notebook’s computations.

1 Like