Sure - my take is that in ~96 hours that makes up this thread, we’ve ALREADY uncovered a million things that have been done before, edge use cases that make certain formats unusable/unreliable, goals that the project had/etc. And we’re just at the start - we’re just going to have tons more uncovered as we turn over more rocks. And, because this is such a rich community with so many opinions, without someone who DEEPLY understands all these issues, they’ll always miss something (core people will also miss stuff, just … less. And do it without stepping on toes because they’ve put in a decade of work with the community). It’s got to be someone who knows all of these issues like the back of his/her hand.
I’d love to have @inc0 do a JEP, but I’ll be honest, as great as he is (), we’re all outsiders. This could end up being a core change to the way data scientists use Jupyter (again, without changing .ipypb format, but if this is done right, this new format could become the new default) - it really should be led by someone from core.
How about regrouping a bit and making a first step towards a JEP. My proposal would be that someone who favours the idea of a “new alternative format for storing notebooks on disk” (we’ve touched on variations on this in this thread but lets focus on something) write a set of features that this new format should have and should not have.
Then we can start lining up the proposed ideas and existing alternative formats and check them off against this list of features/out-of-scope requirements.
This would focus the discussion and be an important building block of a JEP (if someone wants to write one).
This really is fantastic, and i completely agree. Makes the diffing even easier since one regex could cover the entire block.
The beauty of this is it starts to treat the notebook format as it should be - a full data structure - rather than what is currently is, which is just a flow from top to bottom of, effectively, text.
Wow, just started to dive into this discussion. It is great to see interest around this topic. Given all the comments, I am not sure I can provide much new insight. Maybe a bit of lost history… The first notebook format we created back in the summer of 2006 (Min Ragan-Kelley did this while working as my intern) was a relational database (I think it was sqlite) with tables for notebooks, cells, inputs, outputs, etc. Our vision at the time was that we might want to build rich, cross notebook search into the UX. At the time it was overly complex relative to everything else we had and we didn’t go anywhere with it. The next notebook format was done in parallel with the JSON format in the summer of 2011 and was XML. Don’t remember why we thought that was important, but we quickly realized it didn’t provide anything over the JSON format and was harder to maintain. But the biggest shift moving from the SQL DB to the JSON/XML formats was to really embrace the idea that the notebook is just a document on the filesystem.
Regardless of the details of a new format, the biggest challenge will be dealing with the millions of notebooks users already have. But a notebook format that solves the pain points described in this thread would definitely incentivize people to adopt the new format.
Thanks for the feedback @ellisonbg - and for everything you’ve done!
May I ask why? That is to say, if the new format is valuable and provides tooling/workflow that is more inline with how people work today, we could just leave it in the user’s hands to convert. Put another way, we do absolutely nothing to the existing .ipypb format (and continue supporting it as the default), but if people need line based diffing, inline commenting, git based workflows, GNU tooling, integration with hosted Git visualization, etc etc, they do one change (e.g. “Save as…”, enter a field in a cell, whatever) and they get all the benefit.
Then if they need to go back to the original notebook format, they could simply remove the configuration setting or “save as” in the old format, and they’re golden. We have an explicit goal to be lossless back and forth.
I don’t want to take anything away from the existing format! It’s great - but I worry that trying to solve two problems with one technology makes this challenge much harder.
It’s great to see such a lively debate! I can understand the disadvantages of the existing format, but what’s not yet clear to me is why all of the many other existing formats are unsuitable, though admittedly I may have missed it in the other 86 posts .
Has anyone already done, or would anyone be willing to do, a write-up comparing all the existing formats and/or related tools? Something like tables of features, good/bad points, etc. I think this would be helpful for anyone joining this conversation late, and could also provide useful input to a JEP- the JEP might even write itself! Even if you already have a favourite format/tool by looking at others you may find ideas you hadn’t thought of, which you might want to include in a new format.
I think it’s a good idea to start building some support from the community and getting feedback informally (e.g. in conversations like these!) before beginning a full-on JEP process, so +1 to getting feedback here first!
I really like the shape this discussion is taking. Just wanted to chip in and add my two cents:
There are different use cases and the proposed format would need to take them into consideration
I believe the goal should not be to replace the .ipynb format. This is clear from the above discussion, wanted to reiterate.
I see the text representation as the source code “compiling” to .ipynb. This means that the output cells could be ignored for diff purposes. This is the way any other programming language is diffed
My personal preference would be a markdown based format. More specifically the Pandoc’s Markdown format (already supported by jupytext). This is subjective, but I feel that the tooling around MyST is more complex. Besides, nbconvert already incorporates pandoc and it could be leveraged in the future.
Those are two quite different things. One is a serialization format for Jupyter notebooks (implemented in pure Python), the other is a tool for parsing a CommonMark dialect to be used with docutils/Sphinx (AFAICT).
I haven’t used jupyter-sphinx, but according to their docs they provide a jupyter-execute directive for Sphinx (to be used in a reStructuredText file). You can use multiple of those on a page and I guess the local variables are available in subsequent directives (but I’m not sure about that). In the end, the resulting page looks somewhat like a notebook even though the source files isn’t a notebook.
In nbsphinx, you can use Jupyter notebooks directly as source files. You don’t have to use reStructuredText at all (but you still can, if you want). You can use Notebooks stored as .ipynb files, but you can also use any format supported by Jupytext (see Custom Notebook Formats — nbsphinx version 0.7.1). In fact, you can use any tool that can read some file and produce a nbformat-compatible in-memory representation (such as my own https://jupyter-format.readthedocs.io/).
Thanks for actually trying it out!
And yes indeed, the Binder link can be used as a quick way to play with my suggested format. Thanks Binder!
Sure, that’s expected for a format that stores inline outputs. IMHO it’s still more human-readable then the raw JSON of an .ipynb file.
Theoretically, the cell outputs could also be stored at the end of the file (as already suggested in this thread) or in a separate file (also suggested somewhere around here), but this would itself add complexity to the whole situation, which might or might not be desired.
I think the most straightforward way for diffing and merging is to remove the outputs beforehand, and I think for that use case my format is quite well suited.
However, I didn’t want to throw away the possibility to store outputs when desired (which I also do regularly). See also Motivation — Jupyter Format version 67bf141.
I don’t know whether there is a more straightforward way, but you can use “File → Save Notebook As …” and use a file name with the suffix .jupyter. It should also work if you simply rename the file using the suffix .jupyter and then hit “Save” again.
This could probably be improved by creating a Jupyter/JupyterLab extension (similar to how Jupytext does it), but I didn’t have time for that yet.
Thanks!
If you have ideas for improvements, please let me know!
But in general, I agree, merging conflicting outputs (e.g. PNG plots) is basically impossible, at best you can select one out of two (or more?) alternatives and throw away the rest.
Thanks for the concrete example!
I’ve taken the liberty of including it here for easy comparison:
Small update on progress (outside of JEP) - we’re throwing formats at the wall and see what sticks. Current objective is to figure out actual format + conversion code between mystnb <-> ipynb.
Late to the thread, but wanted to chime in with a few brief comments. I work on Google Colab.
Outputs - I believe it’s critical to be able to render notebooks with outputs consistently over long periods of time and across notebook viewers. I believe that viewing of notebooks should remain possible long after the execution of those notebooks is no longer possible. Colab works very hard to ensure long-term rendering fidelity and I am very interested in ensuring output rendering works across more notebook viewing applications.
Comments - comments often use separate storage so the commenter does not need to have write access to the document they are commenting on.
Improving this would be excellent - it’s a pain in the butt to figure out what kinds of goats you have to sacrifice to get <insert your favorite web-based plotting library> to work in other interfaces
I see the desire for improved notebook diffs as an initial step towards better integration into larger project workflows.
Consider integration with existing language tools:
Static type checkers such as mypy
Code formatters
Linters
Rich semantic code viewing (Github’s ‘definition’ and ‘references’ links when viewing python files)
The path towards better project integration will be difficult with a notebook-specific file format. It’s not impossible, but consider native source files (.py) if the rationale is that the format should change because it is too difficult to change the tools (diff).
Jupytext Markdown, R Markdown, Pandoc Markdown, MyST Markdown
Maybe post additional features here prefixed with e.g.?:
REQUIRED:
NICETOHAVE:
Linked Data for Linked Research
Neither YAML-LD nor Markdown-LD are yet things. JSON-LD metadata would be great to have; as these are graphs of resources that link to other resources (#LinkedData, #LinkedResearch, #LinkedReproducibility). There have been some discussions regarding notebook-level and cell-level linked data.
STATUSQUO: Embedded JSON is parseable by search engines with just a JSON parser
Add it to Jupytext
Is there any reason that any new (pre-JEP) format cannot be prototyped in Jupytext?
Your jupyter format looks very interesting. I’m new to the debate between formats though I’m a frequent Jupyter user and have been around the community for a long while. I think your approach has promise.
I would be eager to see your format take advantage of the fact that is defining a custom (though human readable) format and look into at least 2 things:
collect all cell outputs at the end of the file (along with storing a hash of the input cell that produced it).
allow storing multiple input cells in the file (with an easy tool to remove all but the latest). In this way, you could actually replay someone’s workflow as they modified cells and re-ran them.
Have a section that allows storing the execution order (a simple list of cells could do it).
Because you define the format you could add these kinds of carrots that would very capable tools and standards to be built around the notebook concepts.
I don’t really want to define a “new format” in this sense, though.
My goal with GitHub - mgeier/jupyter-format: An Experimental New Storage Format For Jupyter Notebooks is just to provide serialization format for the current notebook data model. It is supposed to be two-way losslessly compatible with the .ipynb serialization format. But that also means that it cannot have any new features that are not available with .ipynb.
Ideally, my suggested .jupyter format (or something along those lines, probably very much improved) would be adopted by the Jupyter project and would co-evolve with the data model (and with the existing .ipynb serialization format).
I didn’t really try this and don’t have any practical experience with such a feature. I have the feeling that it might increase the “human readability”, but it might actually decrease the “diffability” (and “mergeability”). IDs and hashes are inherently not diffable, right?
Apart from that, it will definitely increase the complexity of the format, so I would only want to do it if there is a real improvement to be made.
Which do you think is more important “human readability” or “diffability”?
This sounds like a nice feature, but this should be implemented in the Jupyter notebook data model, not in the serialization format, right?
Again, this is a feature of the data model, right?