Jupyter and GitHub - alternative file format

A quick comment from the someone who wrote that first JEP about the notebook format.

In retrospect, I didn’t choose my title very well. The real issue I wanted to address is not the format, but the data model for the notebook. The data model says what is stored in a notebook, the format says how the data is encoded as a byte sequence (text or otherwise).

The distinction matters because conversion between two formats that implement the same data model is lossless, and can be done automatically in a workflow according to convenience. You can store a list of cells as a JSON file, or an an alternation of text/code in Markdown - it doesn’t really matter.

The real problem is the data model. For Jupyter, it’s roughly (neglecting metadata) a list of cells, each cell being one of “text”, “code”, and “output”. Compare e.g. to RMarkdown, where the data model is also a list of cells, but only of type “text” or “code”. The big problem with the Jupyter data model is that it mixes human-edited data (text, code) with computed data (output). That’s what makes version control difficult: you want all human-edited data version-controlled, but not computed output. The provenance of computed output is tracked via reproducibility, not version control.

The core of my old JEP proposal is splitting the notebook into two separate data items: the human input and the execution trace, which would be a ledger (append-only list) storing input sent to the kernel and output received back from the kernel. This ledger would be reproducible because it preserves all communication with the kernel, not just the output associated with the last interactive execution.

3 Likes

Just a complement to the main thread, I’d also like to mention another class of Jupyter using Github users, and that’s folk who use Github primarily for sharing outputs either as publishers or readers/consumers; for that class of user, good previewing of notebooks and the ability to easily publish via Github Pages is a key consideration.

I note that Github Pages provides first class support Jekyll and wonder if an official Jupyter Book publishing via Github Action would be appropriate.

Seeing how users make use of MyST eg as part of a publishing strategy may further inform MyST development and also reveal insights about end-user readers consume pages output. It might also throw up examples of folk who are not (yet) Github repo users but who do want to be able to comment on published pages. Using services such as Hypothesis to annotate notebooks (related discussion) is one way, but another way might be to encourage folk to comment / question as part of repo comments. If that is appropriate, what would the on-ramp to that be? If it’s not appropriate, is there another level of commenting that Github support. (I also note for example that Github is adding Discussions as well as Issues. Could there similarly be a split between code comments and Pages comments?)

First off, thank you all so much for jumping on this form and taking so much time to participate in this discussion!

From the community side, one jump in this conversation I don’t quite understand is why try to change the notebook format without first getting the most you can out of the existing one?

Trying to read through the lines on your responses, it seems like you feel maybe like the current experience you have for rendering and diffing notebooks is as good as you can get without changing the format? Or just that it seems like a waste of time to improve that, when you could have an even bigger improvement by changing the format?

I, obviously, really appreciate all the work you do at GitHub, but it’s kinda common knowledge in the Jupyter space that the Github rendering of notebooks is very slow and unusable in a lot of cases. I usually end up having to use nbviewer.

And for diffing it has no support at all. Like @echarles said above, the JupyterLab Git plugin has support for rich notebook diffs.

Obviously, this doesn’t help that much with commenting… But it seems like you could figure out a way to deal with comments on notebook files, it would just require a little bit of finagling on your end. The UI seems doable, it just might be different from your current (internal, so I don’t know it) data model about how comments are stored. i.e instead of having comments point to a line number, they could point to a cell number and a line in that cell.

FWIW you might hope/expect that there is a process you can go to get user input and move forward in some linear way on this idea. The JEP is the current best take on that. Socially however Jupyter is still very much focused on consensus. So what I have found this means (and @Zsailer could give a good example of this on his work on splitting out the Jupyter Server) is that you have to have conversations with a bunch of different stakeholders in different subprojects. For example. Jupyter Hub folks, Jupyter Notebook folks, nteract folks, to at least name a few of the larger subgroups.

Just trying to help give you some of my context here, this is an important conversation and I hope it will more forward.

2 Likes

Going back to the user experience, I see rising complexity functionalities:

  • View Notebook (needs the requested visualisation libraries)
  • View Notebook Diff in a 2 Tab way
  • Comment on a Notebook
  • Comment on a Notebook Pull Request
  • Executable Notebook (needs kernel)
  • Reproducible Notebook = Execute a Notebook + have needed datasets mounted.

I guess a single format (json, markdown…) supporting all those functionalities is hard to find.

1 Like

I think that the dream of having a single, text-based format that both has the flexibility of the ipynb standard, as well as the readability/simple diffability/etc of the ipynb format will be difficult to achieve. For example:

  • Most notebook- and cell-level metadata isn’t relevant (and potential is confusing) to humans
  • Many packages embed gigantic HTML/JS bundles in notebooks (e.g., Bokeh or ipywidgets)
  • Many packages embed data in notebooks (e.g., Altair encodes all of the data for any plot in the cell output metadata)
  • And then there’s everybody’s favorite example of base64 encoded images

Some of this speaks to @khinsen’s concerns around interweaving human and machine information in a single format. I’ve quite enjoyed using the workflow of “jupytext text files for human-readable files, and pair them with ipynb files when you need all the complexity of what jupyter can do”.

Or put another way, I don’t know that we want a text-based notebook that also has all of the metadata, outputs information, etc of the current notebooks. We may end up re-creating the same problems that ipynb has, but with a markdown or yaml format instead of JSON.

Also - I second @saulshanabrook’s enthusiasm that you can do a lot of good stuff with the pre-existing notebook standard, I also second his points about building consensus within the community :slight_smile:

2 Likes

I just woken up, let me respond to each comment :wink:

@betatim I think prototype you’re asking for is exactly what I’m going to work on this week. I’d love as many eyes on that work as possible once I get basic stuff working.

@khinsen I’ll second @psychemedia 's point - a lot of people actually want outputs in repo, there are lot of good use cases for that. We already have ways to save notebook as code or clear output so I’d leave this out of this discussion and assume we want to save everything from original ipynb, that includes outputs.

@saulshanabrook we’ve spent a lot of time trying to figure out good way to do this today and came up with nothing:( To me, inline comments are just critical part of any review process anywhere. I can’t imagine having open source notebook project without being able to comment on lines changed during review. That will only ever happen for textual files, so I don’t see any other way to solve it than readable textual notebooks tbh…
imo linear way to process with the idea would go like that:

  1. this discussion and more of it - to get as much prior art and ideas on the table
  2. prototype that’s outside of jupyter codebase, using content manager
  3. iterate over and over to get most of use cases solved and some traction

Then, if we think that it’s worthwhile to add it to jupyter core codebase

  1. JEP with all the learnings and go from there

I think we’re at 1 and 2, now’s the time for coding:)

@echarles - lots of these are handled for human readable text files. So if we manage to save all necessary metadata to notebook and retain human readibility, at least for code cells and output cells, I think we can get 99% of the way there. That’s my goal.

@choldgraf all of issues you listed are valid, but I think orthogonal and will not affect review experience in hugely negative way. Things like bokeh imports are long and opaque, but also they won’t change all that much between iterations of notebook, so they’re going to be hidden in later revisions of file. base64 images are one line changes that you just ignore. Obviously these aren’t ideal scenarios, but it’s still big improvement between all of that, but in json:)

one quick thought there: if it’s acceptable to have a bunch of ugly stuff committed to GitHub so long as it doesn’t change often, then couldn’t many of these problems be resolved by re-working how things are stored internally in an ipynb file? E.g. if the outputs or maybe even the metadata were always in a separate list at the bottom of the JSON structure, and referenced in the list of cells w/ the code etc, then most ipynb diffs as “regular text files” would look relatively reasonable. Just a random thought :slight_smile:

Interesting idea. I was thinking in a similar way, only instead of having json with a duplicated notebook, extract all the not-so-interesting stuff and have it in yaml blocks of MyST and emphasize interesting stuff, namely cell input and output. Review will still have a lot of uninteresting info, like changes in metadata, but as long as interesting info is easy to find and comment on, I think that’s already big improvement. From reviewer standpoint, I’m perfectly ok with mentally ignoring changes in unintelligible metadata yaml as long as I’ll see pretty python below that I can focus on.

Quick clarification: ipywidgets works very hard to not embed HTML/JS in notebooks. It can embed widget data as metadata in a notebook if requested, but that’s not required either.

1 Like

TL;DR: This is an advertisement for my project GitHub - mgeier/jupyter-format: An Experimental New Storage Format For Jupyter Notebooks.

Just to clarify, my format does not use YAML!

In the documentation (Motivation — Jupyter Format version 67bf141) I’m describing how YAML is almost a solution, but not quite, and that we need a new format.

Therefore, my suggested format is a custom non-YAML format. It is very simple and very easy to parse. I would love to hear ideas how to make it even simpler and easier to parse!

To keep complexity low and interoperability high, all metadata is stored as JSON within my suggested format. It probably sounds complicated, but you should just convert a few of your notebooks to my suggested format, and many things will become clearer.

There are a few things:

  • like all Markdown-related formats, special strings (like e.g. backtick fences) are used as markup elements, which means they cannot be contained verbatim in the actual content.
  • outputs are not supported (AFAIK), therefore there is no 2-way lossless conversion.
  • the markup elements are still quite bulky and distracting, my approach leads to (of course IMHO) a cleaner file that’s nicer to look at.

I fully agree.
I’m not suggesting to abolish or fully replace the .ipynb format.
I’m also not suggesting to change the data model.
I’m just suggesting an additional serialization format that’s more human readable (and to some limited extent even human-editable) and better suited for diffing.

Absolutely!

I would like to suggest the .jupyter extension.

I see .ipynb as the (still relevant) format from the good old times where only IPython existed. Creating a new serialization format would give us the possibility to also introduce a more modern file name suffix that is not Python-specific: .jupyter.

Of course.
My suggested format is fully compatible.

In fact, it only changes the top-level structure of the serialized file, all the gory details in the notebook metadata are kept as JSON. So even when the metadata structure is changed in any way between versions, my format will stay fully compatible!

Only very fundamental changes (e.g. adding a new cell type) would affect my format.

I tried YAML and I found that it’s not really suitable, see Motivation — Jupyter Format version 67bf141.

I think we should allow both.
Diffing tools can be extremely useful.
And they will stay useful even if we get a better “diffable” serialization format.

This is the most important aspect for me personally.
I really like using Github’s review tools, but the JSON-based serialization format, while somewhat readable, still has too much cruft when viewed in such a context. I assume the same is true for other non-Github services as well.

My suggested format is supposed to generate minimal noise in the review process.

I agree. That’s a fundamental problem of storing binary stuff in a text file.
But there are still different degrees of this problem.
My suggested format will not solve everything, but IMHO it would still be a significant improvement over the current JSON format (for many, but of course not all use cases).

Also, the current JSON format is unnecessarily human-unfriendly even when the outputs are removed from the file.

What is missing in my proposed format (GitHub - mgeier/jupyter-format: An Experimental New Storage Format For Jupyter Notebooks)?

Yes, absolutely: GitHub - mgeier/jupyter-format: An Experimental New Storage Format For Jupyter Notebooks.

I would love to hear some feedback!

Contents manager is already available: API Documentation — Jupyter Format version 67bf141

There is also a tool for batch-converting the whole history of a repository at once: API Documentation — Jupyter Format version 67bf141

For that use case (but not necessarily using Github Pages), I’ve created GitHub - spatialaudio/nbsphinx: 📒 Sphinx source parser for Jupyter notebooks.

The documentation of my suggested format if created with Sphinx and nbsphinx https://jupyter-format.readthedocs.io/.

I also am just paging in - if I missed anyone in the thread, forgive me.

TLDR; It’s my opinion that we need to meet users where they are around many of these things - not GitHub, but Git. To that end, ultimately, if we can’t make using diff based tooling work, I think we’re going to have a bad time™ for a long time. And FBOW, JSON-based notebooks do not match well with diff tools.

@choldgraf

decided that JSON formatted notebooks was the solution O(forever)
it may mean that the two formats don’t need complete parity. E.g., sometimes it’s useful to have a big machine-readable blob of data like JSON, other times it’s useful to have something human-readable, but there are tradeoffs between both.

There’s something that gives me the willies about this. It’s basically all the problems you can have with cache-ing and layering on transformation on top. I’m NOT saying replace the default format but… I’m not not saying that either? I just hate the idea of two formats that users would interact with.

@choldgraf

specifically, being able to collab on a notebook - JSON’s flow is just too hard for that
saulshanabrook has been recently spearheading an effort to figure out a data model etc for real-time collaboration in Jupyter environments.

I’d love to learn more! The thing is, FBOW, 35% of data scientists use GitHub today (Jetbrains & internal survey) and Git (not necessarily GitHub), seems to be the place for people collab. If we forced them to do RTC only through Jupyter, I feel like that’d leave off a bunch of folks.

@choldgraf

I think step one is having a non-GitHub person lead
I’m less-concerned that the lead is or isn’t from any one particular company. I think what’s more important is that the process of discussion, deciding, etc is one that is open and inclusive, and that many stakeholders in the jupyter community (other companies, open source contributors, researchers and educators, etc) have a chance to weigh in on.

I hear you, but I think having someone NOT at GH is the safest default. We’re just going to (unintentionally) be biased with our view of the world and someone else leading will keep everything honest.

@khinsen
The real problem is the data model. For Jupyter, it’s roughly (neglecting metadata) a list of cells, each cell being one of “text”, “code”, and “output”.

I agree with this whole heartedly. TBH, I’ve ALWAYS disliked the fact that compute output was in the same file as the control input. It just feels wrong - and it’s probably as big an issue around diffing as anything. I’m an outsider though :slight_smile:

@khinsen
The core of my old JEP proposal is splitting the notebook into two separate data items: the human input and the execution trace, which would be a ledger (append-only list) storing input sent to the kernel and output received back from the kernel.

STRONG endorse.

Trying to read through the lines on your responses, it seems like you feel maybe like the current experience you have for rendering and diffing notebooks is as good as you can get without changing the format? Or just that it seems like a waste of time to improve that, when you could have an even bigger improvement by changing the format?

This is the issue, IMO. I do not think it is reasonably possible to create a format based on JSON that:

  • Has clean diffs
  • Is human readable/interactable in a MOSTLY error free way
  • Doesn’t let outputs/etc bleed through

AGAIN, I’m new and have probably missed plenty. Happy to learn!

Obviously, this doesn’t help that much with commenting… But it seems like you could figure out a way to deal with comments on notebook files, it would just require a little bit of finagling on your end.

This is not and will not be possible on the order of heat death of the universe :frowning: The reality is that everything about Git (let alone GitHub) is based on diff, and expecting to change everyone’s flow to support a use case of (ask for a diff → pass through a parser → view/comment/fix diff → round trip → record changes back in original file) is, sadly, just never going to happen. Again, this is my opinion as a coder/PM, not as a MSFT employee.

Going back to the user experience, I see rising complexity functionalities:

  • View Notebook (needs the requested visualisation libraries)
  • View Notebook Diff in a 2 Tab way
  • Comment on a Notebook
  • Comment on a Notebook Pull Request
  • Executable Notebook (needs kernel)
  • Reproducible Notebook = Execute a Notebook + have needed datasets mounted.

I guess a single format (json, markdown…) supporting all those functionalities is hard to find.

I think this is the right list! However, I’d argue that ANY format other than JSON already has this built in. JSON does as well, btw, it’s just how we use JSON that doesn’t. :frowning:

@choldgraf
One quick thought there: if it’s acceptable to have a bunch of ugly stuff committed to GitHub so long as it doesn’t change often, then couldn’t many of these problems be resolved by re-working how things are stored internally in an ipynb file? E.g. if the outputs or maybe even the metadata were always in a separate list at the bottom of the JSON structure, and referenced in the list of cells w/ the code etc, then most ipynb diffs as “regular text files” would look relatively reasonable.

I like this idea a lot! Though, realistically, do they NOT change that often? I just don’t know.

I’d love to have a table that represent the different solution vs the different tradeoffs, I think most of the discussion here is about which format one want to use, instead of what tradeoff are acceptable in a given condition. Even just a list of those tradeoffs would be useful to say “oh, format X lost tradeoff T, how can we get it back”.

I would also love for people to think potentially at a bigger level than “a single notebook”, for example the PG content manager allow to store notebooks in a database, so potentially for example sharing a cell between multiple notebooks, or “replay” the edits to a notebook, and also that notebook is also sometime just an “interface” to interact with code.

I would encourage also to not think about “saving”, and “loading” notebook more as an “Export” notebook, and “Import” notebooks as well, to remove the implicit cognitive bias and limitations attached with thinking that the “on disk” format does contain all the information of a notebook session, and that a “notebook file” is an “Exchange” format, and it’s ok to have the exchange format be different that the one you store locally on your work environment.

A couple of notes on the “.ipynb” json format, and why it is inherently incompatible with Git/GitHub. Remember that Jupyter is from 2012, a time where github was just starting to grow (created in 2008), and that the main communication channel at that time was emailing files back and forth in the Jupyter community. We settled on a single JSON file for a few reason:

  1. its a single file, you can send it to someone and that’s all they need.
  2. Merging code change without rerunning is a recipe to get code and output of of sync. Diff is interesting, but if merge there is there should either:
    • Be no conflict
    • Have no output.
      (I think this should be kept in the new formats, there are way to make sure of that).
  3. Json is ubiquitous and easy to read regardless of the language, and we want to preserve archiving of notebook. A custom hard to implement format have high chances to die.
  4. Regardless of which edits you made there is minimal hidden state, and a deterministic computation would give you a deterministic notebook.

Now if you think about “exporting” notebook, it is perfectly fine to change tradeoff:

  • No output storing, single file, and allow merging -> jupyter_text
  • Remove human readability, and no diff merging, but space efficient -> Binary blob.
  • Remove “single file”, and no consistency of input/output -> store multiple files in a directory tree.
  • Comments on notebooks from many users, but no files -> PostGres backend, export to file loose comments.

Personally I would be in favor of having a default opaque way of storing notebooks, but an easy(ier) way to export/import them.

5 Likes

Can i ask why you’d like the intermediate (or permanent?) format that is opaque and unusable by anything but a machine? I just worry that lands us where we are today - where notebooks work great inside the Jupyter ecosystem, but are essentially unusable by the majority of GNU tools without JSON importing and a whole lot of munging. A flat text format offers a huge amount of additional tooling - awk, sed, wc,

I agree the final format should be in a trivial/ubiquitous format - anything custom needs to be in core jupyter or (ideally) core languages. JSON, YAML, markdown are ideal for this.

FWIW, given the state of the current tech toolchain, my vote is for #1 (no output storing, single file, allow merging). It’s just never felt right to me that outputs are included at all - certainly not in the same file. If you needed outputs, that’d be a different use case - we’re merging editing and outputing into a single file format and I think that hurts our options.

I’m actually in the camp of storing outputs. This opens up a lot of use cases like reports etc.

That said, I think this is very valid conversation to have, but maybe separately as it’s not exactly related to this topic? My goal here is to keep all the information from ipynb file and just change on-disk representation. Let’s not diverge too far from this topic in the thread please.

Btw, we’ve created repo we’ll be working on, https://github.com/machine-learning-apps/mystify . There is hardly anything there right now, but we’ll be working on prototype over next several days.

Oh, I’m not AGAINST storing outputs at all! I just think munging them together in a single file is mixing concerns, and probably isn’t great.

I’m ok delaying this conversation since our stated goal is 100% lossless roundtripping.

2 Likes

@inc0 There are indeed good use cases that involve storing computed output in a versioned repository. My point is that it’s not a good idea to mix input and output and thus make it impossible to version (and diff, and patch, and merge) only the inputs.

BTW, the worst is merge. Merging two diffs on .ipynb files is semantically a four-way merge: two streams of human edits plus two incomplete execution traces. And it’s the human doing the merge who is supposed to judge the cohesion of the four streams.

1 Like

@mgeier How does your GitHub - mgeier/jupyter-format: An Experimental New Storage Format For Jupyter Notebooks implementation compare with GitHub - executablebooks/MyST-Parser: An extended commonmark compliant parser, with bridges to docutils/sphinx?

Same question also about your sphinx extension (GitHub - spatialaudio/nbsphinx: 📒 Sphinx source parser for Jupyter notebooks) compared with GitHub - jupyter/jupyter-sphinx: Sphinx extension for rendering of Jupyter interactive widgets.?

The notion of different formats for different reasons is a powerful one I think:

If you “restart kernel run all / export as ipynb” the document is a complete record of code + outputs; it doesn’t mean you can replicate the run (you’d need a definition of the computational environment for that, which is the subject of other discussions (eg guix)) but is does mean you have something that is shareable and works as a discussion object.

One workflow I’ve found useful is jupytext pairing either markdown or python files with an ipynb doc in a hidden directory; this gives me a simple diff-able text file, with a complete view (text + code + outputs) when I open the doc in a notebook UI. So here I am using the md/py file for the text-code and then a separate file to give me a view of the outputs.

It’s also worth noting that my view of the notebook may differ in semantics from the content of the code in the text file, and the visual appearance may differ too. For example:

  • I use the active-ipynb tag a lot on code cells that lets me see and run the code cell normally in the notebook view, but comments it out in the py/md text files.
  • I use extensions to dynamically style the view I have of a notebook; this might include collapsed sections, dynamic application of style based on tag values etc.

To render the “styled” view of the notebook would require either the previewer to run the same presentation extensions I run over the view, or me to render the document as eg html via a custom export template and share that.

I think this is a really powerful idea (cf. nbexplode, sqlitebiter etc.).

Pondering differently granular table definitions that could support such a backend (where documents are essentially just views over the database) and comparing them to structural elements in a text representation could be interesting…

2 Likes

We have prototyped some commenting in notebooks. Not to get too far into the nitty gritty, but could you elaborate here on what the blockers are?

Notebook’s still have line numbers, they are just divided up into a number of cells. So it seems like there should be some way to lay them out in such a way that commenting becomes possible no?


Ah I see, so this is an issue with Github’s view of doing commenting. It makes sense! I get that special casing one file format over all others in the world is a big ask :slight_smile: But to me whenever I see a problem that could be solved internally technically, that could be the easier route than making some large change in an existing community (just because of the number of people involved!) For example, if GitHub figured out a way to do this, then maybe it would be a bit painful for you all to maintain internally, but it would cause a lot of ease for existing users in the community.

I am not quite clear about the Git/Github grouping here. As brought up earlier, nbdime has support for terminal based and GUI based diffing of Jupyter notebooks. So it seems like that works for local git usage?

So it seems like the driving motivation here is not local people on their machine with git, but really Github’s integration? Both for viewing, diffing, and commenting?

I do understand the point you are making, that the solution here we have is globally suboptimal, and addressing it would fix a bunch of use cases, not just GitHub’s. Just trying to play a bit of devil’s advocate, like I would if I was also trying to solve this need internally, with an understanding of the time scales and costs of large community change.


It’s interesting how this work also overlaps with the current explorations around a richer real time data model for Jupyter clients, by normalizing the notebook model in a relational way.

I don’t want to bring this topic too much off, so I opened an issue on the RTC repo to continue discussion on possible overlaps.

To be clear, this is Git & GNU’s way of diffing, not GitHub :slight_smile: We just inherit it - and with it a whole ecosystem of tooling becomes available. Special casing means all of these other tools are not available - which I think would be a real miss. To be clear, this is the Jupyter community’s decision to make!

The failure for something like nbdime is it’s a one-way format (doesn’t support round tripping AFAICT). However, even if it did support round tripping, that wouldn’t work either, because the comments would only be on the transformed artifact, not the original meaning we’d have to come up with an entirely separate way to support pulling comments back in. Even further, even if it was perfectly easy to round trip AND preserve comments, that’s still not great because you’re just developing a new intermediate format, meaning it’s just a pernicious form of additional caching - e.g. one more thing that can get out of sync and out of date.

We should have core goals that whatever we choose (IFF THE JUPYTER COMMUNITY DECIDES THIS IS WORTH ADDRESSING):

  1. GNU tooling support
  2. Native support for diffing/commenting/etc (no intermediate formats)
  3. Lossless conversion to/from .ipypb (this was @choldgraf’s; I completely agree)

I’m sure there are some other ones.

1 Like