Jupyter and GitHub - alternative file format

aronchick · July 1, 2020, 8:47pm

Sure - my take is that in ~96 hours that makes up this thread, we’ve ALREADY uncovered a million things that have been done before, edge use cases that make certain formats unusable/unreliable, goals that the project had/etc. And we’re just at the start - we’re just going to have tons more uncovered as we turn over more rocks. And, because this is such a rich community with so many opinions, without someone who DEEPLY understands all these issues, they’ll always miss something (core people will also miss stuff, just … less. And do it without stepping on toes because they’ve put in a decade of work with the community). It’s got to be someone who knows all of these issues like the back of his/her hand.

I’d love to have @inc0 do a JEP, but I’ll be honest, as great as he is (), we’re all outsiders. This could end up being a core change to the way data scientists use Jupyter (again, without changing .ipypb format, but if this is done right, this new format could become the new default) - it really should be led by someone from core.

betatim · July 1, 2020, 8:56pm

How about regrouping a bit and making a first step towards a JEP. My proposal would be that someone who favours the idea of a “new alternative format for storing notebooks on disk” (we’ve touched on variations on this in this thread but lets focus on something) write a set of features that this new format should have and should not have.

Then we can start lining up the proposed ideas and existing alternative formats and check them off against this list of features/out-of-scope requirements.

This would focus the discussion and be an important building block of a JEP (if someone wants to write one).

aronchick · July 1, 2020, 9:14pm

This really is fantastic, and i completely agree. Makes the diffing even easier since one regex could cover the entire block.

The beauty of this is it starts to treat the notebook format as it should be - a full data structure - rather than what is currently is, which is just a flow from top to bottom of, effectively, text.

inc0 · July 2, 2020, 1:25am

I’ve started to prototype format here https://github.com/machine-learning-apps/mystify/blob/8cfd8e776ea5cd6dbc6d1bfde8843e4f24a240ad/examples/example_notebook.mystnb <- this is simplistic notebook corresponding to ipynb with same name. Now working towards more complex cases

ellisonbg · July 2, 2020, 3:36am

Wow, just started to dive into this discussion. It is great to see interest around this topic. Given all the comments, I am not sure I can provide much new insight. Maybe a bit of lost history… The first notebook format we created back in the summer of 2006 (Min Ragan-Kelley did this while working as my intern) was a relational database (I think it was sqlite) with tables for notebooks, cells, inputs, outputs, etc. Our vision at the time was that we might want to build rich, cross notebook search into the UX. At the time it was overly complex relative to everything else we had and we didn’t go anywhere with it. The next notebook format was done in parallel with the JSON format in the summer of 2011 and was XML. Don’t remember why we thought that was important, but we quickly realized it didn’t provide anything over the JSON format and was harder to maintain. But the biggest shift moving from the SQL DB to the JSON/XML formats was to really embrace the idea that the notebook is just a document on the filesystem.

Regardless of the details of a new format, the biggest challenge will be dealing with the millions of notebooks users already have. But a notebook format that solves the pain points described in this thread would definitely incentivize people to adopt the new format.

aronchick · July 2, 2020, 9:25am

Thanks for the feedback @ellisonbg - and for everything you’ve done!

May I ask why? That is to say, if the new format is valuable and provides tooling/workflow that is more inline with how people work today, we could just leave it in the user’s hands to convert. Put another way, we do absolutely nothing to the existing .ipypb format (and continue supporting it as the default), but if people need line based diffing, inline commenting, git based workflows, GNU tooling, integration with hosted Git visualization, etc etc, they do one change (e.g. “Save as…”, enter a field in a cell, whatever) and they get all the benefit.

Then if they need to go back to the original notebook format, they could simply remove the configuration setting or “save as” in the old format, and they’re golden. We have an explicit goal to be lossless back and forth.

I don’t want to take anything away from the existing format! It’s great - but I worry that trying to solve two problems with one technology makes this challenge much harder.

manics · July 2, 2020, 3:11pm

It’s great to see such a lively debate! I can understand the disadvantages of the existing format, but what’s not yet clear to me is why all of the many other existing formats are unsuitable, though admittedly I may have missed it in the other 86 posts .

Has anyone already done, or would anyone be willing to do, a write-up comparing all the existing formats and/or related tools? Something like tables of features, good/bad points, etc. I think this would be helpful for anyone joining this conversation late, and could also provide useful input to a JEP- the JEP might even write itself! Even if you already have a favourite format/tool by looking at others you may find ideas you hadn’t thought of, which you might want to include in a new format.

choldgraf · July 2, 2020, 4:43pm

@manics I think that’s a cool idea. Off the top of my head and from this thread, there are:

The Jupyter Notebook format (https://nbformat.readthedocs.io/en/latest/format_description.html)
any jupytext-compatible file format (https://jupytext.readthedocs.io/en/latest/formats.html)
MyST Notebooks (https://myst-nb.readthedocs.io/en/latest/use/markdown.html)
The jupyter-format format (https://jupyter-format.readthedocs.io/en/latest/)

And there is obviously non-Jupyter notebooks like Matlab, RMarkdown notebooks, etc though not sure if we’d want to include them on the list.

aronchick · July 2, 2020, 5:03pm

I’m working on a tentative / very gross / toe in the water JEP doc as we speak. I’ll paste it back here tomorrow?

choldgraf · July 2, 2020, 5:22pm

I look forward to reading it! Just a note in case you haven’t caught it: I’d recommend using this JEP template as a guide:

https://jupyter.org/enhancement-proposals/jupyter-enhancement-proposal-guidelines/JEP-TEMPLATE.html

I think it’s a good idea to start building some support from the community and getting feedback informally (e.g. in conversations like these!) before beginning a full-on JEP process, so +1 to getting feedback here first!

teucer · July 3, 2020, 9:27am

I really like the shape this discussion is taking. Just wanted to chip in and add my two cents:

There are different use cases and the proposed format would need to take them into consideration
I believe the goal should not be to replace the .ipynb format. This is clear from the above discussion, wanted to reiterate.
I see the text representation as the source code “compiling” to .ipynb. This means that the output cells could be ignored for diff purposes. This is the way any other programming language is diffed
My personal preference would be a markdown based format. More specifically the Pandoc’s Markdown format (already supported by jupytext). This is subjective, but I feel that the tooling around MyST is more complex. Besides, nbconvert already incorporates pandoc and it could be leveraged in the future.

willingc · July 4, 2020, 2:56am

Thanks David. I left some thoughts and did some reorganization of the file. Have a nice long weekend.

mgeier · July 5, 2020, 7:02pm

Those are two quite different things. One is a serialization format for Jupyter notebooks (implemented in pure Python), the other is a tool for parsing a CommonMark dialect to be used with docutils/Sphinx (AFAICT).

However, the “MyST” name is a bit confusing, because it can mean two quite different things, see my question Hello from a similar project · Issue #420 · spatialaudio/nbsphinx · GitHub and this answer by @choldgraf: Hello from a similar project · Issue #420 · spatialaudio/nbsphinx · GitHub. Probably the discussion over there answers your question?

I haven’t used jupyter-sphinx, but according to their docs they provide a jupyter-execute directive for Sphinx (to be used in a reStructuredText file). You can use multiple of those on a page and I guess the local variables are available in subsequent directives (but I’m not sure about that). In the end, the resulting page looks somewhat like a notebook even though the source files isn’t a notebook.

In nbsphinx, you can use Jupyter notebooks directly as source files. You don’t have to use reStructuredText at all (but you still can, if you want). You can use Notebooks stored as .ipynb files, but you can also use any format supported by Jupytext (see Custom Notebook Formats — nbsphinx version 0.7.1). In fact, you can use any tool that can read some file and produce a nbformat-compatible in-memory representation (such as my own https://jupyter-format.readthedocs.io/).

Thanks for actually trying it out!

And yes indeed, the Binder link can be used as a quick way to play with my suggested format. Thanks Binder!

Sure, that’s expected for a format that stores inline outputs. IMHO it’s still more human-readable then the raw JSON of an .ipynb file.

Theoretically, the cell outputs could also be stored at the end of the file (as already suggested in this thread) or in a separate file (also suggested somewhere around here), but this would itself add complexity to the whole situation, which might or might not be desired.

I think the most straightforward way for diffing and merging is to remove the outputs beforehand, and I think for that use case my format is quite well suited.
However, I didn’t want to throw away the possibility to store outputs when desired (which I also do regularly). See also Motivation — Jupyter Format version 67bf141.

I don’t know whether there is a more straightforward way, but you can use “File → Save Notebook As …” and use a file name with the suffix .jupyter. It should also work if you simply rename the file using the suffix .jupyter and then hit “Save” again.

This could probably be improved by creating a Jupyter/JupyterLab extension (similar to how Jupytext does it), but I didn’t have time for that yet.

Thanks!

If you have ideas for improvements, please let me know!

I’m still accepting breaking changes, BTW.

Theoretically, if your rich output uses a text-based format like SVG, this could actually work. But I guess it would only work in very simple cases.
Note that SVG diffs can be improved by using the svg.hashsalt option, see https://nbviewer.jupyter.org/github/mgeier/python-audio/blob/master/plotting/matplotlib-inline-defaults.ipynb.

But in general, I agree, merging conflicting outputs (e.g. PNG plots) is basically impossible, at best you can select one out of two (or more?) alternatives and throw away the rest.

Thanks for the concrete example!

I’ve taken the liberty of including it here for easy comparison:

.mystnb

```{metadata}
---
nbformat_minor: 4
nbformat: 4
metadata:
  kernelspec:
    display_name: Python 3
    language: python
    name: python3
  language_info:
    codemirror_mode:
      name: ipython
      version: 3
    file_extension: .py
    mimetype: text/x-python
    name: python
    nbconvert_exporter: python
    pygments_lexer: ipython3
    version: 3.7.7
```
% cell
```{cell_meta}
---
cell_type: code
execution_count: 1
```
```{source}
print("foo")
```
```{output} stream
foo
```
% endcell
% cell
```{cell_meta}
---
cell_type: markdown
```
```{source}
testtest
---------

*test*

of markdown
```
% endcell

And that’s how the same notebook would look in my proposed format:

.jupyter

nbformat 4
nbformat_minor 4
code 1
    print("foo")
 stream stdout
    foo
markdown
    testtest
    ---------
    
    *test*
    
    of markdown
notebook_metadata
    {
     "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
     },
     "language_info": {
      "codemirror_mode": {
       "name": "ipython",
       "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.8.3"
     }
    }

inc0 · July 8, 2020, 6:06pm

Small update on progress (outside of JEP) - we’re throwing formats at the wall and see what sticks. Current objective is to figure out actual format + conversion code between mystnb <-> ipynb.

blois · July 10, 2020, 2:09am

Late to the thread, but wanted to chime in with a few brief comments. I work on Google Colab.

Outputs - I believe it’s critical to be able to render notebooks with outputs consistently over long periods of time and across notebook viewers. I believe that viewing of notebooks should remain possible long after the execution of those notebooks is no longer possible. Colab works very hard to ensure long-term rendering fidelity and I am very interested in ensuring output rendering works across more notebook viewing applications.

Comments - comments often use separate storage so the commenter does not need to have write access to the document they are commenting on.

choldgraf · July 10, 2020, 3:10pm

Improving this would be excellent - it’s a pain in the butt to figure out what kinds of goats you have to sacrifice to get <insert your favorite web-based plotting library> to work in other interfaces

blois · July 10, 2020, 4:36pm

I see the desire for improved notebook diffs as an initial step towards better integration into larger project workflows.

Consider integration with existing language tools:

Static type checkers such as mypy
Code formatters
Linters
Rich semantic code viewing (Github’s ‘definition’ and ‘references’ links when viewing python files)

The path towards better project integration will be difficult with a notebook-specific file format. It’s not impossible, but consider native source files (.py) if the rationale is that the format should change because it is too difficult to change the tools (diff).

westurner · July 13, 2020, 9:37am

Feature / Format Matrix (as a Markdown table in a JEP?)

Name, Source URL, Spec, Docs URL
Required features:
- Binary outputs (that are stable for the future)
- Notebook metadata
- Cell metadata
- Renders read-only with just JS (is this a requirement?)
- Line-diffable (with indentation)
  - nbdime is an existing solution for diffing notebooks written in Python
- Source-editable
  - https://jupytext.readthedocs.io/en/latest/formats.html#markdown-formats
    - Jupytext Markdown, R Markdown, Pandoc Markdown, MyST Markdown
Maybe post additional features here prefixed with e.g.?:
- REQUIRED:
- NICETOHAVE:

Linked Data for Linked Research
Neither YAML-LD nor Markdown-LD are yet things. JSON-LD metadata would be great to have; as these are graphs of resources that link to other resources (#LinkedData, #LinkedResearch, #LinkedReproducibility). There have been some discussions regarding notebook-level and cell-level linked data.

STATUSQUO: Embedded JSON is parseable by search engines with just a JSON parser

Add it to Jupytext
Is there any reason that any new (pre-JEP) format cannot be prototyped in Jupytext?

teoliphant · August 7, 2020, 9:21am

Your jupyter format looks very interesting. I’m new to the debate between formats though I’m a frequent Jupyter user and have been around the community for a long while. I think your approach has promise.

I would be eager to see your format take advantage of the fact that is defining a custom (though human readable) format and look into at least 2 things:

collect all cell outputs at the end of the file (along with storing a hash of the input cell that produced it).
allow storing multiple input cells in the file (with an easy tool to remove all but the latest). In this way, you could actually replay someone’s workflow as they modified cells and re-ran them.
Have a section that allows storing the execution order (a simple list of cells could do it).

Because you define the format you could add these kinds of carrots that would very capable tools and standards to be built around the notebook concepts.

mgeier · August 9, 2020, 2:45pm

I know, you’re a legend!

I don’t really want to define a “new format” in this sense, though.

My goal with GitHub - mgeier/jupyter-format: An Experimental New Storage Format For Jupyter Notebooks is just to provide serialization format for the current notebook data model. It is supposed to be two-way losslessly compatible with the .ipynb serialization format. But that also means that it cannot have any new features that are not available with .ipynb.

Ideally, my suggested .jupyter format (or something along those lines, probably very much improved) would be adopted by the Jupyter project and would co-evolve with the data model (and with the existing .ipynb serialization format).

I didn’t really try this and don’t have any practical experience with such a feature. I have the feeling that it might increase the “human readability”, but it might actually decrease the “diffability” (and “mergeability”). IDs and hashes are inherently not diffable, right?

Apart from that, it will definitely increase the complexity of the format, so I would only want to do it if there is a real improvement to be made.

Which do you think is more important “human readability” or “diffability”?

This sounds like a nice feature, but this should be implemented in the Jupyter notebook data model, not in the serialization format, right?

Again, this is a feature of the data model, right?

This sounds a bit like GitHub - microsoft/gather: Spit shine for Jupyter notebooks 🧽✨, doesn’t it?

Again, this doesn’t seem to me like something the serialization format should do on its own.

Topic		Replies	Views
Should Jupyter recommend a text-based representation of the notebook? Notebook	27	5466	November 22, 2021
GitNotebooks: Notebooks Reviews Notebook	0	362	November 17, 2023
Microsoft Word Integration (Intern Project) JupyterLab	24	10539	October 2, 2019
Feature Idea: A specification for notebook output dependencies Notebook feature-idea	18	1683	August 12, 2021
Proposed-JEP: Investigate alternate, optional file formats Notebook	14	1263	July 13, 2020

Jupyter and GitHub - alternative file format

Related topics