Using git hooks to maintain a "cleaned output" notebook branch

This is slightly off topic for this forum unless regarded as a Jupyter workflow problem, so apols in advance if it is off-topic…

I have a git repository with:

  • a set of notebooks in the master branch that contain run notebooks with output cells populated;

  • a set of derived notebooks derived from master in a clean branch that have same notebooks but with output cells stripped…

Tools such as nbstripout help automate the creation of “output cleaned” notebooks.

Is there a set of git commit hooks / filters I can use so that a commit of one or more notebooks to the master branch will result in a cleaned copy of the same notebook(s) being committed to the clean branch?

The master branch may contain a lot of notebooks in nested directories, so ideally I only want to run the notebook cleaner over notebooks that have been updated.

it would also be useful if the clean branch were configured from the start to ignore .ipynb_checkpoint/ notebooks.

1 Like

Sounds like a cool idea and I am sure the all mighty Git can do it but I don’t know how.

I’d start investigating from https://git-scm.com/docs/gitattributes. I use it to automatically en-/decrypt files in a repo with:

secrets/** filter=git-crypt diff=git-crypt

or to automatically use nbdime for diffs:

*.ipynb	diff=jupyternotebook
*.ipynb	merge=jupyternotebook

where the command/tool called jupyternotebook is defined in .git/config:

[diff "jupyternotebook"]
	command = git-nbdiffdriver diff
[merge "jupyternotebook"]
	driver = git-nbmergedriver merge %O %A %B %L %P
	name = jupyter notebook merge driver

I wonder what happens if you have a different .gitattributes in each branch and then cherry pick commits from your master branch to the clean branch. Does that pply the filters defined in the .gitattributes of the clean branch?

1 Like

I don’t know if this can be automated with a commit hook (probably not), but I would suggest going the different direction:

Only commit “clean” notebooks, and have a single branch (with a single commit) where the notebooks are executed. I’ve described the suggested workflow here:
https://mg.readthedocs.io/git-jupyter.html#executing-notebooks-in-a-separate-branch

The setup is mostly automatic, but the daily use requires (for now) some manual steps.

The advantage of this approach is that the whole history is “clean” and all diffs are readable. But still, having a branch with executed notebooks allows the outputs to be visible, e.g., on nbviewer.

@mgeier Thanks for that suggestion and the link.

My preferred way of working would be to commit .md files into master and then use commit hooks to run the md via jupytext etc to produce run notebooks automatically committed into a separate branch. Having that branch empty of commit messages would be interesting.

Unfortunately, I’m having trouble persuading anyone else in the merits: a) of a Jupytext mediated approach; b) empty notebooks, rather than run notebooks, as the the thing committed by users. (The wider feeling is a user should commit the run notebook so they can see that they are committing the run notebook as they expect it to be run and then the output stripped notebook derived from that.)

–tony

I think the approach from my link above doesn’t work if you use Markdown files. Rebasing the “executed” branch would cause conflicts.

You could re-convert and re-execute all files (not only changed files) each time to avoid rebasing, then it could work. But it would be annoying if you have many notebooks (and you change only a few of them).

An alternative approach would be to have only the original Markdown files in your repo (without a separate branch for converted/executed notebooks).
Then you could use jupytext as a “contents manager”, see https://jupytext.readthedocs.io/en/latest/using-server.html#global-configuration.

You can also use this “contents manager” on Binder. For an example configuration see https://github.com/PlasmaPy/PlasmaPy/blob/0876fb363d3ff57eac662655e275bcca48f36884/.jupyter/jupyter_notebook_config.py.
The disadvantage is that all your users will have to install and configure jupytext if they want to open the files locally.

Instead of having executed notebooks in your repo, you can provide static HTML pages (including the cell outputs) with nbsphinx (full disclosure, I’m the author). See https://nbsphinx.readthedocs.io/en/0.4.3/custom-formats.html.
Here are two random example pages created from Markdown files:

Re: using Jupytext: yes, that would be the idea… Generate ipynb from md, essentially, and treat (as far as repo is concerned) the md as the first class source.

(By default, I use Jupytext in all my environments, with no pairing… it means I can edit py and md files in the notebook UI.)