How do you manage (kernel-)dependencies?

Dear Jovians,

I’ve used JupyterLab (and previously Jupyter and even IPython Notebook) for quite a while now and spend some time to develop a workflow how I setup projects that are centered around notebooks.
I am mostly happy with how this worked out for me so far, but there is one point that strikes me as sub optimal: dependency management. More precisely: dependency management for Python kernels.
I am interested to hear how you do this.

Here is a description of what I do and what I view as problematic:

Whenever I start a new project, I make a dedicated virtual environment for it. Since I tend to be minimalist and did not have many (if any) problems with this so far, I just use the venv module that is part of the Python 3 standard library to make the environment and pip to install packages:

$ cd path/to/my/project

# create the environment
$ python -m venv env
source env/bin/activate

# pip install your dependencies
(env) $ pip install ipykernel scipy holoviews ... all the good stuff

# register the kernel
(env) $ ipykernel install --name yet-another-kernel --user

# pin your dependencies
(env) $ pip freeze > requirements.txt

So I get a new kernel for my project and all is good, I can install and upgrade packages as I need and do not break anything in my other projects.

The problem

It is not uncommon for me that some project is idle for quite a while. For example, I teach a lab course where I regularly have to do some data evaluation, e.g. for a student experiment that is new or got revised. I prepare a couple of notebooks to do this data evaluation. They are part of a dedicated project and have their own environment. Then the experiment runs for a couple of semesters and the notebooks are left untouched until I decide that I might want to work on it again. Naturally, I would want to reuse my previous project and environment and pick up the work from there, but in the meantime the world kept spinning and all my favorite packages got upgrades that I would like to make use of.

So what now?

I could just update the packages in the old environment, but I run the risk of breaking the notebooks I made before. So no. Or I could just make another environment and register another kernel. Better, but I can tell you that you can collect a lot of kernels over time like this. Also not very nice. Or one could just leave it and use the old package versions, but hey …

The solution?

What I like very much is how it works out with Julia. There you have your Project.toml and Manifest.toml and you do not create a dedicated kernel for a project. Rather, you have one Julia kernel (OK, or more, if you run several versions of Julia) and then you just

import Pkg
Pkg.activate("path/to/whatever/project/you/like")

(and perhaps Pkg.instantiate()). So as long as you put your notebook(s) in a separate folder with its own Project.toml and Manifest.toml, your good, and you do not clutter your kernel selection window.

Stand-alone / Self-contained notebooks

Because of the way Python handles its environments (mainly that you cannot change the environment from within a running Python interpreter session, like you can in Julia), the solution has to look different here. So what I am thinking about is that the dependencies are somehow stored inside the notebook to make them “stand-alone” or “self-contained” or however one would might want to call it, and the environment being something that is really a throw-away thingy (pip’s cache is what counts, so you do not have to re-download all the time).
Self-contained notebooks is of course not a new idea: Pluto.jl notebooks now have the ability to store their dependencies since some of the more recent releases.
For Jupyter I stumbled upon lupyterlab-requirements, but that crashed on me with both, the thoth and the pipenv backend (in a fresh jupyter/base-notebook docker container). Then I found in this forum the Guix-Jupyter kernel, but I was unlucky again, I tried to install jupyter and guix-jupyter via the guix package manager but got errors (something with competing packages and failed python tests). The Guix approach sounds very nice to me, bit I am a bit worried w.r.t. the added complexity and time I would need to invest to learn this tool, plus the kernel seems to not have proper syntax highlighting (and perhaps other missing features?) because it is language agnostic.

So what I ended up doing, at least as an experiment for now …
I added a %%bash cell to my notebook which creates an environment on the fly and pins the dependencies explicitly:

%%bash
# create and activate virtual environment
VENV=/tmp/env
rm -r $VENV 2> /dev/null || echo "No temporary environment found, creating."
python -m venv $VENV
source $VENV/bin/activate
pip install -U pip wheel

# dependencies go here (ipykernel always required!)
echo """
astroscrappy==1.1.0
holoviews==1.14.9
h5py==3.6.0
ipykernel
matplotlib==3.5.2
pandas==1.4.2
scipy==1.8.0
""" | xargs -I {} pip install {}

python -m ipykernel install --name="notebook" --user

Using this, my kernel will always be notebook, but it will be replaced as needed. I may have to reinstall a lot, but thanks to pip’s cache, I am not too worried about that.
Granted, this does not work for non-python dependencies, but one could of course swap pip for conda with this approach.
Anyway, I am sure that this approach has some flaws that I do not see yet, and maybe I overlooked some obvious tool that is available. I also noted there is some criticism w.r.t. notebooks that declare their dependencies by themself, but I cannot recall where I read that and do not really see the problem.

How do you manage your dependencies? Should the notebook not be responsible for its own dependencies? Do you see some obvious flaw in my %%bash hack?

I am eager to learn a better way.

Sorry for the long post.

Kind regards
Nils

1 Like

I don’t have too much time to reply here, so I’ll keep it very short.

I treat my analyses like applications - using a pinning project manager (https://pdm.fming.dev/). This means that I will likely be able to install the project again in the future if my Python version isn’t too new.

2 Likes

Thanks for sharing your approach, @agoose77 :slightly_smiling_face:

I was not aware of PDM and tried it out right away.
The result looks similar to the Julia way, e.g. you just make your own folder for one or several notebooks and have a pyproject.toml and pdm.lock file for tracking the dependencies.

I had to guess how to manage the jupyter kernel and came up with the following:

If you make a new project starting from an environment that includes jupyter-lab,* you can just do a pdm run jupyter-lab from within your project folder (jupyter environment active), and the default kernel will include the packages installed with PDM in the project’s __pypackages__ folder.
That is really convenient and puts a hold on adding ever more kernels.

There was one concern which I had, which was that the __pypackage__ folders in many projects would eventually take up a lot of disk space. Indeed, I tested it and when I installed holoviews in a fresh PDM project, the __pypackage__ folder was 270 Mb already! But looking a bit deeper in the PDM docs reveals that you can activate linking from a central package cache. So I set

pdm config -g install.cache True

and the size of __pypackages__ shrank to 1.4 Mb. Very cool :sunglasses:

So yeah, that seems to check pretty much all the boxes. I will experiment with it and see if that works out long term. Thanks again!


*I guess many people will just have that installed system wide, so the system environment.

1 Like

The __pypackages__ feature is actually a nice side-effect of using PDM (it is useful for other features too), and it can be disabled if you prefer not to use it.