Should Jupyter recommend a text-based representation of the notebook?

A human-friendly text file that only contains the content of the notebook and some kind of cell structure

spyder implements notebook cell-like behaviour by splitting up a single text file with # %% markers/sentinels:

https://docs.spyder-ide.org/editor.html#defining-code-cells

…and it seems VSCode has taken a leaf out of spyder’s book:

I haven’t really used either functionality so couldn’t comment further - just trying to summarise current apis.

I’ve been trying to come up with a new serialization format for Jupyter notebooks: https://jupyter-format.readthedocs.io/.

I was originally thinking about using YAML, but now I think it should be a custom format, see https://jupyter-format.readthedocs.io/motivation.html for more details.

The idea of this format is that it is really human readable (unlike JSON) and that it produces nice diffs (unlike JSON) but it can still contain arbitrary outputs and metadata (just like JSON; in fact, using JSON for metadata). No information is lost, it’s just a different serialization format.

nbexplode is a fun experiment that I’m still kind of fond of, but I don’t think anything like that will be practical unless version control systems change dramatically. The idea was that version control systems don’t know about any structure within the file, so why not use the structure they do know about: nested folders. But it falls down because folders are only dicts - there’s no equivalent of an ordered list/array of files.

We discussed at one point the idea of a ‘companion file’: the idea that each notebook would comprise a text-based, version control friendly file of code and markdown, and a zip file containing things like rich outputs. You could either track only the first part in version control, or check in the companion file as well, but treat that like a binary blob, not trying to diff or merge it. But I don’t think it ever got implemented.

4 Likes

Thanks all for some great links and ideas. A few quick points from me:

To me this question is more around standards than around technology. E.g., there could be a simplified text-based specification for a notebook that didn’t have any specific tech behind it (maybe Jupytext could have a reference implementation or something).

As the links in this thread show, people out there are trying many things, and at some point there’s value in collapsing this space back to a smaller number of things so that we can standardize and start building on top of those new standards.

I think it would be a simpler process if we didn’t frame this as trying to replace the ipynb format, but instead how we can define a human-friendly way to store the content of a notebook, so that the ipynb file can be used to store all of the extra information most humans don’t care about in machine-readable ways.

One starting point proposal

To me, the most obvious structure to use would be something like Pandoc markdown block syntax mixed w/ RMarkdown cell syntax (I think it would be worth drawing on both of these as they are pre-existing de-facto standards even if neither is an “actual” standard in the way that CommonMark is)

  • Code cells are denoted with (ignore the backslashes):

    \```{language}
    \```
    
  • Markdown cells are anything in-between code cells unless explicitly specified otherwise

  • Markdown renders code blocks the same way that we do now, with e.g.: \```python

  • Manual splits between markdown cells are created with ::: syntax, like

    # This is my markdown
    here is content
    
    :::
    ## Here is another markdown cell
    :::
    
    And this would now be a third markdown cell.
    
  • If you wanted a different type of text cell (e.g. raw etc) you’d specify it with a name in the ticks:

    ::: raw
    Some raw text
    :::
    
  • Metadata could be given one of two ways

    • As in-line attributes given in { }, where vals starting with . are treated as tags.

      ::: {key=val .tag1 .tag2}
      Some content
      :::
      
    • As YAML front-matter that is parsed first within the containing content of that cell, e.g. (again ignore slashes)

      \```{python}
      mygroup:
         - mykey: myval
         - mykey2: myval2
      mykey3: myval3
      ---
      # This is valid python
      print('hi')
      \```
      
  • Notebook-level metadata would be stored in a YAML header at the top of the page

  • Any metadata stored in the “content” format would be loaded into the ipynb format, and there could be metadata in the ipynb format that doesn’t make it into the text format.

  • You the file extension of this format would be .jmd or .imd, or I suppose it could also just be .md since all the other markdown flavors also just overload that file extension too…

  • Taken by itself, this would only define the content structure of a notebook, it doesn’t know anything about outputs or programmatically generated metadata. Over time, tooling could be built to more cleverly handle synchronization between these formats

  • Over time, perhaps this specification could be extended to handle more complex information like outputs, but at a start we’d keep it content-focused.

That’s one idea but I’m sure there are many others to explore. I think it’d be useful to do so in a structured way.

1 Like

I’m unsure about this: it depends on the kind of problem the text representation solves. To me the main use of jupytext is the ease of version control and the ability of manual editing. What would be the goal of the globally recommended text representation?

Doesn’t this go against the notebook spec where the language would be global to a complete notebook? How would the case of multiple languages be interpreted? Or would this be an illegal notebook representation?

I think the main goal would be a combination of:

  1. Minimize the number of text-based versions of notebooks by agreeing on a standard
  2. Having more voices and opinions be considered in the creation of any one standard

Right now for example, Jupytext supports many text-based representations, each of which was created with a particular perspective in-mind. That’s fine, but I’m sure that to some degree there is overlapping functionality and goals in each of those perspectives, and they’d benefit from a single format that could be jointly-used rather than multiple formats that were created as one-off solutions for a particular tool, community, etc.

Good point - I still think notebooks should have a “one kernel per notebook” mapping. I was just trying to think of a way to distinguish “runnable code” from “code blocks”. An alternative to this was proposed by @mwouts in https://github.com/mwouts/jupytext/issues/422#issuecomment-582952022, he suggested using ~~~ to denote “runnable code blocks” and backticks denote markdown code blocks. In the end, I care more that there’s a standard than that a particular syntax gets used :slight_smile:

1 Like

The purpose of that sort of thing is for “typesetting” of multiple languages. For example, instructions in bash, an example in yaml intermixed with executable code. The bookdown documentation has lots of great examples for this.

To what extent does Rmd fall short of what you’re thinking?

One advantage of the ipynb format is that is goes some way to capturing cell outputs, which can be a wide variety of mimetypes.

Document formats like docx have directories containing media assets (I think?) which allow documents to be self-contained in a zip package. A similar storage format could be useful if for example, you create a video or iframed HTML assets, although it would be nice if these assets were linked in a simple relative address way from within the main document.

Thanks @choldgraf for starting this conversation!

Well, text representations are useful for

  • version control
  • copy/pasting the content of a notebook to another one (& templating)
  • refactoring the code in a notebook
  • executing or debugging a notebook as a script
  • rendering a notebook in another context than in Jupyter (e.g. as Markdown on Github)

In my opinion the format that is the closest to be a standard text representation for the notebook is the double-percent format (scripts with cells indicated with # %%, markdown cells with # %% [markdown] or # %% [md]). It has the longest history (was introduced by Spyder 5-6 years ago), and is supported by many other editors (Atom/Hydrogen, PTVS, VS Code, PyCharm Pro).

I think that, if Jupyter wanted to recommend a text format for notebooks, it should start with that one.

But to say the truth, it will probably not easy to have everyone agreeing on the format, even on this one. Clearly the specs should say how a code cell should be represented (all editor seem to agree with # %%). And markdown cells (not all editors agree yet, but I’m sure they will follow the Jupyter choice). Then, should the spec include notebook and cell metadata? Should the cell name have a special representation (Spyder cells may have titles, which unlike Jupyter cell names, may not be unique…)? Should Markdown cells be encoded in multiline strings? (Probably not easy to implement for all languages…) Should the script follow PEP8 when all the input cells do?

1 Like

That being said, another format that I like a lot is Markdown. It is a great format for writing documentation, and it can be edited/previewed in many editors and platforms[^1]. It naturally accepts all the programming languages that one can use in Jupyter. But there again, it’s easy to start, i.e. decide to represent Markdown cells as text, and include code cells within Markdown code cells prefixed with ```python, but it’s harder to go to the next step and decide how to separate consecutive Markdown cells, how to represent notebook or cell metadata, or as you mention, define which part of the code are executable, or not.

For now in Jupytext we have given the precedence to the principle that the text version should look like the notebook when previewed, and hence used e.g. HTML comments to represent the cell breaks and include the metadata, but that does make manual typing a bit cumbersome.

Regarding the idea of companion files, @takluyver, we did implement that in Jupytext, and it really is a great idea. It’s so convenient to be able to edit any of the representations, either text or .ipynb. Also the complete notebook (with outputs and metadata) is always available in the .ipynb file.

Finally, a word on outputs. I like very much the idea of saving them in a companion folder, in addition to the Markdown file, and I’d be curious to work on that when time permits. This way, we would avoid the duplication of the notebook inputs. And, if thinks were done well, we could directly use that representation of the notebook as the input for a Jekyll or Hugo blog post or chapter… But maybe that leads to too many other questions (e.g. how to include the ouputs in a Markdown file?)!

[^1]: for that reason I have a preference for Markdown with .md extension rather than for R Markdown with .Rmd extension

Thanks, that’s interesting to hear.

One minor part of the idea, where it sounds like your version differs, was that the companion file would be a binary format (most likely zip), so that tools like git wouldn’t even try to diff or merge it except with a plugin. The idea was that treating it like a binary blob, where you just see that it’s changed with no details, would be a better experience than text-based diffs of JSON.

This is maybe not that important, and maybe designing formats around a specific external tool is a bad idea anyway, but it seems worth remembering for a discussion like this. :slight_smile:

1 Like

Thanks @takluyver, yes indeed I liked the idea of the zip file! They are easier to share than a master file + directory, and that’s true, I’ve seen plugins able to show the nested differences when required.

Also I agree with your comment above that files are not naturally ordered. Maybe something we could do is to give explicit default name to the cells like e.g. unnamed_code_cell_1. That would be an invitation for the users to name their cells if they want a) less diffs in the ouput names and b) more meaningful output names. R Markdown does this, and I found it more natural than the random output names generated by nbconvert :slight_smile:.

But before that I’d like to find a convincing way to include the outputs (other than images, e.g. text and HTML outputs) in the main Markdown document… Did anyone look at that question before? Can I use <iframe>, or something like Jekyll’s {% include_relative ... %}? Any chance that I can use a shortcode in Hugo that would be compatible with Jekyll’s include_relative?

Having # %% presupposes a line comment starting with #, which may not be true in a particular notebook’s language.

The one language we do know exists in a notebook is Markdown. I think we give up something if a text-based format cannot be run as a normal file in a target language, but we also gain something if the text-based format is language agnostic.

1 Like

Yes, I did not mention that… the comment char # in # %% actually stand for the language’s line comment, at least that’s how we implemented the support for 18 languages in Jupytext (I wish we had the information of the line comment in the notebook itself :slight_smile:)

Agreed. And for some notebooks (tutorials, books, documentation…) I do love using that format. But for some other notebooks (those with a lot of code), I may prefer the script format, since it lets me refactor and edit the notebook in the IDE.

1 Like

A quick update from us, over in https://github.com/ExecutableBookProject/MyST-NB/issues/12 we worked out a prototype for how we’ll represent notebooks in the Jupyter Book project (using a flavor of markdown called MyST). We welcome thoughts or feedback!

Here’s a short example of how a notebook will be represented in that format:

---
kernel_info:
    name: python3
language_info:
    name: Python
title: "My notebook title"
comment: "If any of the above aren't specified then use jupyter defaults"
---

# Markdown syntax

## Cell breaks

We can manually break markdown cells quickly with this syntax

+++ {"cell": "meta", "cell2": "meta2"}

## Markdown metadata

We can also explicitly separate a markdown cell and configure it like so:

```{markdown} tag1, tag2
---
key: val
---
## Here is some *configured* markdown!
```

We can also provide a `:key: val` shorthand for configuring

```{markdown} tag1, tag2
:key: val
## Here is some *configured* markdown!
```

## Executable code

Code is always executed with 'execute' blocks, like so:

```{code-cell}
print('this would be run by the front-matter-specified, or default, kernel')
```

You can also add metadata to these

```{code-cell} kernelname
:key: val
:key2: val2
:tags: ["tag1", "tag2"]
print('some python with cell metadata')
```
and that's it!

This is great discussion! I’ve started similar thread Jupyter and GitHub - alternative file formant , I’m glad it’s already being thought of.

I was experimenting with similar approach to jupyter-format - namely content_manager class as means to allow users to work on text-format notebooks transparently.

I was experimenting with markdown + yaml mix that would look similar to this:

MyST looks very promising! Is there any work to create MyST notebook content manager?

1 Like

Not yet, but that’s mostly just because we’re running on limited resources and time right now. Our plan for MyST is to see how people use it, what they like about it, whether it gains more adoption and interest from the community. We’d also like to add extensions (e.g. a Jupyter Lab extension, or a notebook extension) that add functionality for MyST markdown in Jupyter interfaces (this would be much easier if https://github.com/jupyterlab/jupyterlab/issues/272 happens, this is because MyST uses a python implementation of markdown-it in the hopes that we can easily port over a MyST parser to Javascript…in fact, some of this is already done as there is now a MyST markdown plugin for vscode).

If at some point there seems to be enough momentum and we think MyST has made the right technical and syntax choices, then we can build in more “core” support and, one day, potentially try and standardize some of its syntax within Jupyter. But I think it’d be a long road to that point - as I mentioned in the other thread, it would need to be a community discussion and decision, probably a JEP.

(happy to chat more about myst if it’s helpful)

I think the way pandoc exports ipynb files is really interesting. Given the ubiquitous use of pandoc and the ability to convert to other formats (notably latex), IMHO it is a strong candidate.

@choldgraf what were the reasons to develop a new format?

Some of our reasoning is detailed here: https://myst-parser.readthedocs.io/en/latest/#why-myst-markdown and the notebook version here: https://myst-nb.readthedocs.io/en/latest/use/markdown.html

the tl;dr is: we really like markdown, we wanted something that could handle more complex use-cases than CommonMark. We really like Sphinx and think it has most of the features one would want for publishing. We also didn’t want to force people to use reStructuredText, and so we created MyST, which is an attempt at “best of both worlds” for reStructuredText and markdown. (side note: we also wanted something that was heavily inspired by RMarkdown)

It seems that the MyST and pandoc markdown formats are really close. The major difference is the directives.

E.g. codebraid uses pandoc’s markdown. One can further break blocks with :::.

One idea that I am exploring is to build the document interactively with jupyter, export it to pandoc’s markdown, execute with codebraid, and convert to latex.

I understand that jupyter book is aiming for a similar goal using sphinx.