Should Jupyter recommend a text-based representation of the notebook?

Over time we have seen an increasing interest in using Jupyter as a part of traditional text-based collaborative workflows (e.g. git, diffing, etc). There seem to be more and more projects that create their own text-based specification for how a notebook is structured. These projects generally do this in a very ad-hoc way that suits their own needs.

I suspect that this will only become more common, especially as excellent projects like jupytext gain traction.

To that extent, I am curious if folks think it would be useful to try and decide on a ā€œrecommendedā€ text-based specification for Jupyter Notebooks. This doesnā€™t have to be an officially-supported spec in, e.g., JupyterLab etc, but it could be a community recommendation as others build tooling in the ecosystem, to try and reduce the forking paths of one-off specifications.

Do folks think it would be a good idea to, say, open up a JEP to begin discussion of this?

Note - Iā€™m not talking about changing the default-supported markdown in Jupyter interfaces, but instead just talking about how one could represent the structure of the notebook (e.g., the content blocks, metadata about the blocks, and if/how to include outputs)

4 Likes

cc @mwouts who has been thinking through this recentlyā€¦I wonder if he thinks it would be helpful for Jupyter to provide this kind of guidance

1 Like

IIUC both jupyterlab-debugger and jupyterlab-lsp have their own shadow/virtual filesystems since standard tooling doesnā€™t work with the notebook json format.

Iā€™m not really across the details so you might have to reach out to the devs for details.

In terms of prior-art, Iā€™ve always thought @takluyverā€™s nbexplode was a good idea which could solve a lot of problems. Sure, it might introduce some other issues but maybe it will solve more problems than it creates. If nothing else, there may be some lessons to be learned from itā€¦

2 Likes

Thanks for the extra links @dhirschfeld! I also think nbexplode was a clever idea as well. I kind of wonder if there is an intermediate between nbexplode and the current state of things, where each notebook is 2 files:

  • A human-friendly text file that only contains the content of the notebook and some kind of cell structure
  • An ipynb file that has all of the metadata/outputs/etc that one needs to actually render the notebook

Then youā€™d find a system to ensure that the two files are in-sync with one another. Maybe something like a server extension to auto-mirror changes when using a Jupyter interface, and default to ā€œthe text file is the master copyā€ if any ambiguities pop up. You could already almost do this using the Jupytext server extension.

On the explode idea, I was using a decomposition of a notebook into a sqlite db for some search experiments way back when using sqlbiter.

Iā€™m finding using Jupytext with linked ipynb and py documents useful. Tagging notebook cells with active-ipynb means these are commented out in the py paired document. The notebook can then have full rich output display, as well as lots of demonstrations of calling functions and displaying their outputs, running tests etc. The paired py document is loadable as a module exposing just the functions that arenā€™t in active-ipynb tagged code cells. A separate script can be used to filter out active-ipynb tagged cells exported to a py file to give a cleaner py file if required.

1 Like

A human-friendly text file that only contains the content of the notebook and some kind of cell structure

spyder implements notebook cell-like behaviour by splitting up a single text file with # %% markers/sentinels:

https://docs.spyder-ide.org/editor.html#defining-code-cells

ā€¦and it seems VSCode has taken a leaf out of spyderā€™s book:

I havenā€™t really used either functionality so couldnā€™t comment further - just trying to summarise current apis.

Iā€™ve been trying to come up with a new serialization format for Jupyter notebooks: https://jupyter-format.readthedocs.io/.

I was originally thinking about using YAML, but now I think it should be a custom format, see https://jupyter-format.readthedocs.io/motivation.html for more details.

The idea of this format is that it is really human readable (unlike JSON) and that it produces nice diffs (unlike JSON) but it can still contain arbitrary outputs and metadata (just like JSON; in fact, using JSON for metadata). No information is lost, itā€™s just a different serialization format.

nbexplode is a fun experiment that Iā€™m still kind of fond of, but I donā€™t think anything like that will be practical unless version control systems change dramatically. The idea was that version control systems donā€™t know about any structure within the file, so why not use the structure they do know about: nested folders. But it falls down because folders are only dicts - thereā€™s no equivalent of an ordered list/array of files.

We discussed at one point the idea of a ā€˜companion fileā€™: the idea that each notebook would comprise a text-based, version control friendly file of code and markdown, and a zip file containing things like rich outputs. You could either track only the first part in version control, or check in the companion file as well, but treat that like a binary blob, not trying to diff or merge it. But I donā€™t think it ever got implemented.

4 Likes

Thanks all for some great links and ideas. A few quick points from me:

To me this question is more around standards than around technology. E.g., there could be a simplified text-based specification for a notebook that didnā€™t have any specific tech behind it (maybe Jupytext could have a reference implementation or something).

As the links in this thread show, people out there are trying many things, and at some point thereā€™s value in collapsing this space back to a smaller number of things so that we can standardize and start building on top of those new standards.

I think it would be a simpler process if we didnā€™t frame this as trying to replace the ipynb format, but instead how we can define a human-friendly way to store the content of a notebook, so that the ipynb file can be used to store all of the extra information most humans donā€™t care about in machine-readable ways.

One starting point proposal

To me, the most obvious structure to use would be something like Pandoc markdown block syntax mixed w/ RMarkdown cell syntax (I think it would be worth drawing on both of these as they are pre-existing de-facto standards even if neither is an ā€œactualā€ standard in the way that CommonMark is)

  • Code cells are denoted with (ignore the backslashes):

    \```{language}
    \```
    
  • Markdown cells are anything in-between code cells unless explicitly specified otherwise

  • Markdown renders code blocks the same way that we do now, with e.g.: \```python

  • Manual splits between markdown cells are created with ::: syntax, like

    # This is my markdown
    here is content
    
    :::
    ## Here is another markdown cell
    :::
    
    And this would now be a third markdown cell.
    
  • If you wanted a different type of text cell (e.g. raw etc) youā€™d specify it with a name in the ticks:

    ::: raw
    Some raw text
    :::
    
  • Metadata could be given one of two ways

    • As in-line attributes given in { }, where vals starting with . are treated as tags.

      ::: {key=val .tag1 .tag2}
      Some content
      :::
      
    • As YAML front-matter that is parsed first within the containing content of that cell, e.g. (again ignore slashes)

      \```{python}
      mygroup:
         - mykey: myval
         - mykey2: myval2
      mykey3: myval3
      ---
      # This is valid python
      print('hi')
      \```
      
  • Notebook-level metadata would be stored in a YAML header at the top of the page

  • Any metadata stored in the ā€œcontentā€ format would be loaded into the ipynb format, and there could be metadata in the ipynb format that doesnā€™t make it into the text format.

  • You the file extension of this format would be .jmd or .imd, or I suppose it could also just be .md since all the other markdown flavors also just overload that file extension tooā€¦

  • Taken by itself, this would only define the content structure of a notebook, it doesnā€™t know anything about outputs or programmatically generated metadata. Over time, tooling could be built to more cleverly handle synchronization between these formats

  • Over time, perhaps this specification could be extended to handle more complex information like outputs, but at a start weā€™d keep it content-focused.

Thatā€™s one idea but Iā€™m sure there are many others to explore. I think itā€™d be useful to do so in a structured way.

1 Like

Iā€™m unsure about this: it depends on the kind of problem the text representation solves. To me the main use of jupytext is the ease of version control and the ability of manual editing. What would be the goal of the globally recommended text representation?

Doesnā€™t this go against the notebook spec where the language would be global to a complete notebook? How would the case of multiple languages be interpreted? Or would this be an illegal notebook representation?

I think the main goal would be a combination of:

  1. Minimize the number of text-based versions of notebooks by agreeing on a standard
  2. Having more voices and opinions be considered in the creation of any one standard

Right now for example, Jupytext supports many text-based representations, each of which was created with a particular perspective in-mind. Thatā€™s fine, but Iā€™m sure that to some degree there is overlapping functionality and goals in each of those perspectives, and theyā€™d benefit from a single format that could be jointly-used rather than multiple formats that were created as one-off solutions for a particular tool, community, etc.

Good point - I still think notebooks should have a ā€œone kernel per notebookā€ mapping. I was just trying to think of a way to distinguish ā€œrunnable codeā€ from ā€œcode blocksā€. An alternative to this was proposed by @mwouts in Consider using pandoc markdown for "div"s and RMarkdown for code cells in Jupytext markdown Ā· Issue #422 Ā· mwouts/jupytext Ā· GitHub, he suggested using ~~~ to denote ā€œrunnable code blocksā€ and backticks denote markdown code blocks. In the end, I care more that thereā€™s a standard than that a particular syntax gets used :slight_smile:

1 Like

The purpose of that sort of thing is for ā€œtypesettingā€ of multiple languages. For example, instructions in bash, an example in yaml intermixed with executable code. The bookdown documentation has lots of great examples for this.

To what extent does Rmd fall short of what youā€™re thinking?

One advantage of the ipynb format is that is goes some way to capturing cell outputs, which can be a wide variety of mimetypes.

Document formats like docx have directories containing media assets (I think?) which allow documents to be self-contained in a zip package. A similar storage format could be useful if for example, you create a video or iframed HTML assets, although it would be nice if these assets were linked in a simple relative address way from within the main document.

Thanks @choldgraf for starting this conversation!

Well, text representations are useful for

  • version control
  • copy/pasting the content of a notebook to another one (& templating)
  • refactoring the code in a notebook
  • executing or debugging a notebook as a script
  • rendering a notebook in another context than in Jupyter (e.g. as Markdown on Github)

In my opinion the format that is the closest to be a standard text representation for the notebook is the double-percent format (scripts with cells indicated with # %%, markdown cells with # %% [markdown] or # %% [md]). It has the longest history (was introduced by Spyder 5-6 years ago), and is supported by many other editors (Atom/Hydrogen, PTVS, VS Code, PyCharm Pro).

I think that, if Jupyter wanted to recommend a text format for notebooks, it should start with that one.

But to say the truth, it will probably not easy to have everyone agreeing on the format, even on this one. Clearly the specs should say how a code cell should be represented (all editor seem to agree with # %%). And markdown cells (not all editors agree yet, but Iā€™m sure they will follow the Jupyter choice). Then, should the spec include notebook and cell metadata? Should the cell name have a special representation (Spyder cells may have titles, which unlike Jupyter cell names, may not be uniqueā€¦)? Should Markdown cells be encoded in multiline strings? (Probably not easy to implement for all languagesā€¦) Should the script follow PEP8 when all the input cells do?

1 Like

That being said, another format that I like a lot is Markdown. It is a great format for writing documentation, and it can be edited/previewed in many editors and platforms[^1]. It naturally accepts all the programming languages that one can use in Jupyter. But there again, itā€™s easy to start, i.e. decide to represent Markdown cells as text, and include code cells within Markdown code cells prefixed with ```python, but itā€™s harder to go to the next step and decide how to separate consecutive Markdown cells, how to represent notebook or cell metadata, or as you mention, define which part of the code are executable, or not.

For now in Jupytext we have given the precedence to the principle that the text version should look like the notebook when previewed, and hence used e.g. HTML comments to represent the cell breaks and include the metadata, but that does make manual typing a bit cumbersome.

Regarding the idea of companion files, @takluyver, we did implement that in Jupytext, and it really is a great idea. Itā€™s so convenient to be able to edit any of the representations, either text or .ipynb. Also the complete notebook (with outputs and metadata) is always available in the .ipynb file.

Finally, a word on outputs. I like very much the idea of saving them in a companion folder, in addition to the Markdown file, and Iā€™d be curious to work on that when time permits. This way, we would avoid the duplication of the notebook inputs. And, if thinks were done well, we could directly use that representation of the notebook as the input for a Jekyll or Hugo blog post or chapterā€¦ But maybe that leads to too many other questions (e.g. how to include the ouputs in a Markdown file?)!

[^1]: for that reason I have a preference for Markdown with .md extension rather than for R Markdown with .Rmd extension

Thanks, thatā€™s interesting to hear.

One minor part of the idea, where it sounds like your version differs, was that the companion file would be a binary format (most likely zip), so that tools like git wouldnā€™t even try to diff or merge it except with a plugin. The idea was that treating it like a binary blob, where you just see that itā€™s changed with no details, would be a better experience than text-based diffs of JSON.

This is maybe not that important, and maybe designing formats around a specific external tool is a bad idea anyway, but it seems worth remembering for a discussion like this. :slight_smile:

1 Like

Thanks @takluyver, yes indeed I liked the idea of the zip file! They are easier to share than a master file + directory, and thatā€™s true, Iā€™ve seen plugins able to show the nested differences when required.

Also I agree with your comment above that files are not naturally ordered. Maybe something we could do is to give explicit default name to the cells like e.g. unnamed_code_cell_1. That would be an invitation for the users to name their cells if they want a) less diffs in the ouput names and b) more meaningful output names. R Markdown does this, and I found it more natural than the random output names generated by nbconvert :slight_smile:.

But before that Iā€™d like to find a convincing way to include the outputs (other than images, e.g. text and HTML outputs) in the main Markdown documentā€¦ Did anyone look at that question before? Can I use <iframe>, or something like Jekyllā€™s {% include_relative ... %}? Any chance that I can use a shortcode in Hugo that would be compatible with Jekyllā€™s include_relative?

Having # %% presupposes a line comment starting with #, which may not be true in a particular notebookā€™s language.

The one language we do know exists in a notebook is Markdown. I think we give up something if a text-based format cannot be run as a normal file in a target language, but we also gain something if the text-based format is language agnostic.

1 Like

Yes, I did not mention thatā€¦ the comment char # in # %% actually stand for the languageā€™s line comment, at least thatā€™s how we implemented the support for 18 languages in Jupytext (I wish we had the information of the line comment in the notebook itself :slight_smile:)

Agreed. And for some notebooks (tutorials, books, documentationā€¦) I do love using that format. But for some other notebooks (those with a lot of code), I may prefer the script format, since it lets me refactor and edit the notebook in the IDE.

1 Like

A quick update from us, over in https://github.com/ExecutableBookProject/MyST-NB/issues/12 we worked out a prototype for how weā€™ll represent notebooks in the Jupyter Book project (using a flavor of markdown called MyST). We welcome thoughts or feedback!

Hereā€™s a short example of how a notebook will be represented in that format:

---
kernel_info:
    name: python3
language_info:
    name: Python
title: "My notebook title"
comment: "If any of the above aren't specified then use jupyter defaults"
---

# Markdown syntax

## Cell breaks

We can manually break markdown cells quickly with this syntax

+++ {"cell": "meta", "cell2": "meta2"}

## Markdown metadata

We can also explicitly separate a markdown cell and configure it like so:

```{markdown} tag1, tag2
---
key: val
---
## Here is some *configured* markdown!
```

We can also provide a `:key: val` shorthand for configuring

```{markdown} tag1, tag2
:key: val
## Here is some *configured* markdown!
```

## Executable code

Code is always executed with 'execute' blocks, like so:

```{code-cell}
print('this would be run by the front-matter-specified, or default, kernel')
```

You can also add metadata to these

```{code-cell} kernelname
:key: val
:key2: val2
:tags: ["tag1", "tag2"]
print('some python with cell metadata')
```
and that's it!