Using git hooks to maintain a "cleaned output" notebook branch

psychemedia · September 26, 2019, 12:24pm

This is slightly off topic for this forum unless regarded as a Jupyter workflow problem, so apols in advance if it is off-topic…

I have a git repository with:

a set of notebooks in the master branch that contain run notebooks with output cells populated;
a set of derived notebooks derived from master in a clean branch that have same notebooks but with output cells stripped…

Tools such as nbstripout help automate the creation of “output cleaned” notebooks.

Is there a set of git commit hooks / filters I can use so that a commit of one or more notebooks to the master branch will result in a cleaned copy of the same notebook(s) being committed to the clean branch?

The master branch may contain a lot of notebooks in nested directories, so ideally I only want to run the notebook cleaner over notebooks that have been updated.

it would also be useful if the clean branch were configured from the start to ignore .ipynb_checkpoint/ notebooks.

betatim · September 27, 2019, 5:16am

Sounds like a cool idea and I am sure the all mighty Git can do it but I don’t know how.

I’d start investigating from https://git-scm.com/docs/gitattributes. I use it to automatically en-/decrypt files in a repo with:

secrets/** filter=git-crypt diff=git-crypt

or to automatically use nbdime for diffs:

*.ipynb	diff=jupyternotebook
*.ipynb	merge=jupyternotebook

where the command/tool called jupyternotebook is defined in .git/config:

[diff "jupyternotebook"]
	command = git-nbdiffdriver diff
[merge "jupyternotebook"]
	driver = git-nbmergedriver merge %O %A %B %L %P
	name = jupyter notebook merge driver

I wonder what happens if you have a different .gitattributes in each branch and then cherry pick commits from your master branch to the clean branch. Does that pply the filters defined in the .gitattributes of the clean branch?

mgeier · October 14, 2019, 12:01pm

I don’t know if this can be automated with a commit hook (probably not), but I would suggest going the different direction:

Only commit “clean” notebooks, and have a single branch (with a single commit) where the notebooks are executed. I’ve described the suggested workflow here:
https://mg.readthedocs.io/git-jupyter.html#executing-notebooks-in-a-separate-branch

The setup is mostly automatic, but the daily use requires (for now) some manual steps.

The advantage of this approach is that the whole history is “clean” and all diffs are readable. But still, having a branch with executed notebooks allows the outputs to be visible, e.g., on nbviewer.

psychemedia · October 16, 2019, 12:39pm

@mgeier Thanks for that suggestion and the link.

My preferred way of working would be to commit .md files into master and then use commit hooks to run the md via jupytext etc to produce run notebooks automatically committed into a separate branch. Having that branch empty of commit messages would be interesting.

Unfortunately, I’m having trouble persuading anyone else in the merits: a) of a Jupytext mediated approach; b) empty notebooks, rather than run notebooks, as the the thing committed by users. (The wider feeling is a user should commit the run notebook so they can see that they are committing the run notebook as they expect it to be run and then the output stripped notebook derived from that.)

–tony

mgeier · October 20, 2019, 4:03pm

I think the approach from my link above doesn’t work if you use Markdown files. Rebasing the “executed” branch would cause conflicts.

You could re-convert and re-execute all files (not only changed files) each time to avoid rebasing, then it could work. But it would be annoying if you have many notebooks (and you change only a few of them).

An alternative approach would be to have only the original Markdown files in your repo (without a separate branch for converted/executed notebooks).
Then you could use jupytext as a “contents manager”, see https://jupytext.readthedocs.io/en/latest/using-server.html#global-configuration.

You can also use this “contents manager” on Binder. For an example configuration see https://github.com/PlasmaPy/PlasmaPy/blob/0876fb363d3ff57eac662655e275bcca48f36884/.jupyter/jupyter_notebook_config.py.
The disadvantage is that all your users will have to install and configure jupytext if they want to open the files locally.

Instead of having executed notebooks in your repo, you can provide static HTML pages (including the cell outputs) with nbsphinx (full disclosure, I’m the author). See https://nbsphinx.readthedocs.io/en/0.4.3/custom-formats.html.
Here are two random example pages created from Markdown files:

psychemedia · October 21, 2019, 12:49pm

Re: using Jupytext: yes, that would be the idea… Generate ipynb from md, essentially, and treat (as far as repo is concerned) the md as the first class source.

(By default, I use Jupytext in all my environments, with no pairing… it means I can edit py and md files in the notebook UI.)

WesleyTheGeolien · November 9, 2020, 2:24pm

I wonder if something like this will work?

I agree they only do it on one branch but I am guessing the pre-commit hook can be used to switch branches and commit to the other branch as well?

gabrieltorcat-arc · April 8, 2021, 9:40pm

This is better : GitHub - kynan/nbstripout: strip output from Jupyter and IPython notebooks.

Topic		Replies	Views
Notebook to GitHub JupyterHub jupyterlab , jupyterhub , how-to , help-wanted	2	562	January 26, 2021
How to Version Control Jupyter Notebooks Notebook blog-post	22	26419	March 8, 2023
Committing ipynb_checkpoints to GitHub Special Topics	0	1899	August 30, 2019
Developing jupyter notebooks on Github Notebook	2	845	March 13, 2021
How to version Jupyter notebooks in git without output Notebook help-wanted	9	6537	August 20, 2024

Using git hooks to maintain a "cleaned output" notebook branch

Related topics