Should Jupyter recommend a text-based representation of the notebook?

choldgraf · February 7, 2020, 11:37pm

Over time we have seen an increasing interest in using Jupyter as a part of traditional text-based collaborative workflows (e.g. git, diffing, etc). There seem to be more and more projects that create their own text-based specification for how a notebook is structured. These projects generally do this in a very ad-hoc way that suits their own needs.

I suspect that this will only become more common, especially as excellent projects like jupytext gain traction.

To that extent, I am curious if folks think it would be useful to try and decide on a “recommended” text-based specification for Jupyter Notebooks. This doesn’t have to be an officially-supported spec in, e.g., JupyterLab etc, but it could be a community recommendation as others build tooling in the ecosystem, to try and reduce the forking paths of one-off specifications.

Do folks think it would be a good idea to, say, open up a JEP to begin discussion of this?

Note - I’m not talking about changing the default-supported markdown in Jupyter interfaces, but instead just talking about how one could represent the structure of the notebook (e.g., the content blocks, metadata about the blocks, and if/how to include outputs)

choldgraf · February 7, 2020, 11:39pm

cc @mwouts who has been thinking through this recently…I wonder if he thinks it would be helpful for Jupyter to provide this kind of guidance

dhirschfeld · February 8, 2020, 3:20am

IIUC both jupyterlab-debugger and jupyterlab-lsp have their own shadow/virtual filesystems since standard tooling doesn’t work with the notebook json format.

I’m not really across the details so you might have to reach out to the devs for details.

In terms of prior-art, I’ve always thought @takluyver’s nbexplode was a good idea which could solve a lot of problems. Sure, it might introduce some other issues but maybe it will solve more problems than it creates. If nothing else, there may be some lessons to be learned from it…

choldgraf · February 8, 2020, 10:15am

Thanks for the extra links @dhirschfeld! I also think nbexplode was a clever idea as well. I kind of wonder if there is an intermediate between nbexplode and the current state of things, where each notebook is 2 files:

A human-friendly text file that only contains the content of the notebook and some kind of cell structure
An ipynb file that has all of the metadata/outputs/etc that one needs to actually render the notebook

Then you’d find a system to ensure that the two files are in-sync with one another. Maybe something like a server extension to auto-mirror changes when using a Jupyter interface, and default to “the text file is the master copy” if any ambiguities pop up. You could already almost do this using the Jupytext server extension.

psychemedia · February 8, 2020, 11:55am

On the explode idea, I was using a decomposition of a notebook into a sqlite db for some search experiments way back when using sqlbiter.

I’m finding using Jupytext with linked ipynb and py documents useful. Tagging notebook cells with active-ipynb means these are commented out in the py paired document. The notebook can then have full rich output display, as well as lots of demonstrations of calling functions and displaying their outputs, running tests etc. The paired py document is loadable as a module exposing just the functions that aren’t in active-ipynb tagged code cells. A separate script can be used to filter out active-ipynb tagged cells exported to a py file to give a cleaner py file if required.

dhirschfeld · February 8, 2020, 12:39pm

A human-friendly text file that only contains the content of the notebook and some kind of cell structure

spyder implements notebook cell-like behaviour by splitting up a single text file with # %% markers/sentinels:

https://docs.spyder-ide.org/editor.html#defining-code-cells

…and it seems VSCode has taken a leaf out of spyder’s book:

I haven’t really used either functionality so couldn’t comment further - just trying to summarise current apis.

mgeier · February 8, 2020, 1:15pm

I’ve been trying to come up with a new serialization format for Jupyter notebooks: https://jupyter-format.readthedocs.io/.

I was originally thinking about using YAML, but now I think it should be a custom format, see https://jupyter-format.readthedocs.io/motivation.html for more details.

The idea of this format is that it is really human readable (unlike JSON) and that it produces nice diffs (unlike JSON) but it can still contain arbitrary outputs and metadata (just like JSON; in fact, using JSON for metadata). No information is lost, it’s just a different serialization format.

takluyver · February 8, 2020, 4:32pm

nbexplode is a fun experiment that I’m still kind of fond of, but I don’t think anything like that will be practical unless version control systems change dramatically. The idea was that version control systems don’t know about any structure within the file, so why not use the structure they do know about: nested folders. But it falls down because folders are only dicts - there’s no equivalent of an ordered list/array of files.

We discussed at one point the idea of a ‘companion file’: the idea that each notebook would comprise a text-based, version control friendly file of code and markdown, and a zip file containing things like rich outputs. You could either track only the first part in version control, or check in the companion file as well, but treat that like a binary blob, not trying to diff or merge it. But I don’t think it ever got implemented.

choldgraf · February 8, 2020, 9:54pm

Thanks all for some great links and ideas. A few quick points from me:

To me this question is more around standards than around technology. E.g., there could be a simplified text-based specification for a notebook that didn’t have any specific tech behind it (maybe Jupytext could have a reference implementation or something).

As the links in this thread show, people out there are trying many things, and at some point there’s value in collapsing this space back to a smaller number of things so that we can standardize and start building on top of those new standards.

I think it would be a simpler process if we didn’t frame this as trying to replace the ipynb format, but instead how we can define a human-friendly way to store the content of a notebook, so that the ipynb file can be used to store all of the extra information most humans don’t care about in machine-readable ways.

One starting point proposal

To me, the most obvious structure to use would be something like Pandoc markdown block syntax mixed w/ RMarkdown cell syntax (I think it would be worth drawing on both of these as they are pre-existing de-facto standards even if neither is an “actual” standard in the way that CommonMark is)

Code cells are denoted with (ignore the backslashes):
```
\```{language}
\```
```
Markdown cells are anything in-between code cells unless explicitly specified otherwise
Markdown renders code blocks the same way that we do now, with e.g.: \```python

Manual splits between markdown cells are created with ::: syntax, like

# This is my markdown
here is content

:::
## Here is another markdown cell
:::

And this would now be a third markdown cell.

If you wanted a different type of text cell (e.g. raw etc) you’d specify it with a name in the ticks:
```
::: raw
Some raw text
:::
```
Metadata could be given one of two ways
- As in-line attributes given in { }, where vals starting with . are treated as tags.
```
::: {key=val .tag1 .tag2}
Some content
:::
```
- As YAML front-matter that is parsed first within the containing content of that cell, e.g. (again ignore slashes)
```
\```{python}
mygroup:
   - mykey: myval
   - mykey2: myval2
mykey3: myval3
---
# This is valid python
print('hi')
\```
```
Notebook-level metadata would be stored in a YAML header at the top of the page
Any metadata stored in the “content” format would be loaded into the ipynb format, and there could be metadata in the ipynb format that doesn’t make it into the text format.
You the file extension of this format would be .jmd or .imd, or I suppose it could also just be .md since all the other markdown flavors also just overload that file extension too…
Taken by itself, this would only define the content structure of a notebook, it doesn’t know anything about outputs or programmatically generated metadata. Over time, tooling could be built to more cleverly handle synchronization between these formats
Over time, perhaps this specification could be extended to handle more complex information like outputs, but at a start we’d keep it content-focused.

That’s one idea but I’m sure there are many others to explore. I think it’d be useful to do so in a structured way.

Anton_Akhmerov · February 8, 2020, 10:20pm

I’m unsure about this: it depends on the kind of problem the text representation solves. To me the main use of jupytext is the ease of version control and the ability of manual editing. What would be the goal of the globally recommended text representation?

Doesn’t this go against the notebook spec where the language would be global to a complete notebook? How would the case of multiple languages be interpreted? Or would this be an illegal notebook representation?

choldgraf · February 8, 2020, 10:36pm

I think the main goal would be a combination of:

Minimize the number of text-based versions of notebooks by agreeing on a standard
Having more voices and opinions be considered in the creation of any one standard

Right now for example, Jupytext supports many text-based representations, each of which was created with a particular perspective in-mind. That’s fine, but I’m sure that to some degree there is overlapping functionality and goals in each of those perspectives, and they’d benefit from a single format that could be jointly-used rather than multiple formats that were created as one-off solutions for a particular tool, community, etc.

Good point - I still think notebooks should have a “one kernel per notebook” mapping. I was just trying to think of a way to distinguish “runnable code” from “code blocks”. An alternative to this was proposed by @mwouts in Consider using pandoc markdown for "div"s and RMarkdown for code cells in Jupytext markdown · Issue #422 · mwouts/jupytext · GitHub, he suggested using ~~~ to denote “runnable code blocks” and backticks denote markdown code blocks. In the end, I care more that there’s a standard than that a particular syntax gets used

jlperla · February 9, 2020, 2:39pm

The purpose of that sort of thing is for “typesetting” of multiple languages. For example, instructions in bash, an example in yaml intermixed with executable code. The bookdown documentation has lots of great examples for this.

psychemedia · February 9, 2020, 3:44pm

To what extent does Rmd fall short of what you’re thinking?

One advantage of the ipynb format is that is goes some way to capturing cell outputs, which can be a wide variety of mimetypes.

Document formats like docx have directories containing media assets (I think?) which allow documents to be self-contained in a zip package. A similar storage format could be useful if for example, you create a video or iframed HTML assets, although it would be nice if these assets were linked in a simple relative address way from within the main document.

mwouts · February 9, 2020, 6:01pm

Thanks @choldgraf for starting this conversation!

Well, text representations are useful for

version control
copy/pasting the content of a notebook to another one (& templating)
refactoring the code in a notebook
executing or debugging a notebook as a script
rendering a notebook in another context than in Jupyter (e.g. as Markdown on Github)

In my opinion the format that is the closest to be a standard text representation for the notebook is the double-percent format (scripts with cells indicated with # %%, markdown cells with # %% [markdown] or # %% [md]). It has the longest history (was introduced by Spyder 5-6 years ago), and is supported by many other editors (Atom/Hydrogen, PTVS, VS Code, PyCharm Pro).

I think that, if Jupyter wanted to recommend a text format for notebooks, it should start with that one.

But to say the truth, it will probably not easy to have everyone agreeing on the format, even on this one. Clearly the specs should say how a code cell should be represented (all editor seem to agree with # %%). And markdown cells (not all editors agree yet, but I’m sure they will follow the Jupyter choice). Then, should the spec include notebook and cell metadata? Should the cell name have a special representation (Spyder cells may have titles, which unlike Jupyter cell names, may not be unique…)? Should Markdown cells be encoded in multiline strings? (Probably not easy to implement for all languages…) Should the script follow PEP8 when all the input cells do?

mwouts · February 9, 2020, 6:17pm

That being said, another format that I like a lot is Markdown. It is a great format for writing documentation, and it can be edited/previewed in many editors and platforms[^1]. It naturally accepts all the programming languages that one can use in Jupyter. But there again, it’s easy to start, i.e. decide to represent Markdown cells as text, and include code cells within Markdown code cells prefixed with ```python, but it’s harder to go to the next step and decide how to separate consecutive Markdown cells, how to represent notebook or cell metadata, or as you mention, define which part of the code are executable, or not.

For now in Jupytext we have given the precedence to the principle that the text version should look like the notebook when previewed, and hence used e.g. HTML comments to represent the cell breaks and include the metadata, but that does make manual typing a bit cumbersome.

Regarding the idea of companion files, @takluyver, we did implement that in Jupytext, and it really is a great idea. It’s so convenient to be able to edit any of the representations, either text or .ipynb. Also the complete notebook (with outputs and metadata) is always available in the .ipynb file.

Finally, a word on outputs. I like very much the idea of saving them in a companion folder, in addition to the Markdown file, and I’d be curious to work on that when time permits. This way, we would avoid the duplication of the notebook inputs. And, if thinks were done well, we could directly use that representation of the notebook as the input for a Jekyll or Hugo blog post or chapter… But maybe that leads to too many other questions (e.g. how to include the ouputs in a Markdown file?)!

[^1]: for that reason I have a preference for Markdown with .md extension rather than for R Markdown with .Rmd extension

takluyver · February 10, 2020, 9:49am

Thanks, that’s interesting to hear.

One minor part of the idea, where it sounds like your version differs, was that the companion file would be a binary format (most likely zip), so that tools like git wouldn’t even try to diff or merge it except with a plugin. The idea was that treating it like a binary blob, where you just see that it’s changed with no details, would be a better experience than text-based diffs of JSON.

This is maybe not that important, and maybe designing formats around a specific external tool is a bad idea anyway, but it seems worth remembering for a discussion like this.

mwouts · February 10, 2020, 11:24pm

Thanks @takluyver, yes indeed I liked the idea of the zip file! They are easier to share than a master file + directory, and that’s true, I’ve seen plugins able to show the nested differences when required.

Also I agree with your comment above that files are not naturally ordered. Maybe something we could do is to give explicit default name to the cells like e.g. unnamed_code_cell_1. That would be an invitation for the users to name their cells if they want a) less diffs in the ouput names and b) more meaningful output names. R Markdown does this, and I found it more natural than the random output names generated by nbconvert .

But before that I’d like to find a convincing way to include the outputs (other than images, e.g. text and HTML outputs) in the main Markdown document… Did anyone look at that question before? Can I use <iframe>, or something like Jekyll’s {% include_relative ... %}? Any chance that I can use a shortcode in Hugo that would be compatible with Jekyll’s include_relative?

jasongrout · February 11, 2020, 12:30am

Having # %% presupposes a line comment starting with #, which may not be true in a particular notebook’s language.

The one language we do know exists in a notebook is Markdown. I think we give up something if a text-based format cannot be run as a normal file in a target language, but we also gain something if the text-based format is language agnostic.

mwouts · February 11, 2020, 1:29am

Yes, I did not mention that… the comment char # in # %% actually stand for the language’s line comment, at least that’s how we implemented the support for 18 languages in Jupytext (I wish we had the information of the line comment in the notebook itself )

Agreed. And for some notebooks (tutorials, books, documentation…) I do love using that format. But for some other notebooks (those with a lot of code), I may prefer the script format, since it lets me refactor and edit the notebook in the IDE.

choldgraf · March 13, 2020, 10:12pm

A quick update from us, over in https://github.com/ExecutableBookProject/MyST-NB/issues/12 we worked out a prototype for how we’ll represent notebooks in the Jupyter Book project (using a flavor of markdown called MyST). We welcome thoughts or feedback!

Here’s a short example of how a notebook will be represented in that format:

---
kernel_info:
    name: python3
language_info:
    name: Python
title: "My notebook title"
comment: "If any of the above aren't specified then use jupyter defaults"
---

# Markdown syntax

## Cell breaks

We can manually break markdown cells quickly with this syntax

+++ {"cell": "meta", "cell2": "meta2"}

## Markdown metadata

We can also explicitly separate a markdown cell and configure it like so:

```{markdown} tag1, tag2
---
key: val
---
## Here is some *configured* markdown!
```

We can also provide a `:key: val` shorthand for configuring

```{markdown} tag1, tag2
:key: val
## Here is some *configured* markdown!
```

## Executable code

Code is always executed with 'execute' blocks, like so:

```{code-cell}
print('this would be run by the front-matter-specified, or default, kernel')
```

You can also add metadata to these

```{code-cell} kernelname
:key: val
:key2: val2
:tags: ["tag1", "tag2"]
print('some python with cell metadata')
```
and that's it!

Topic		Replies	Views
Jupyter and GitHub - alternative file format Notebook community , idea	101	10243	May 31, 2021
Notebook Cell-Type Generalisation Notebook markdown	8	2602	May 17, 2022
Inline variable insertion in markdown Notebook notebook , feature-idea , markdown	137	40465	June 29, 2023
Microsoft Word Integration (Intern Project) JupyterLab	24	10265	October 2, 2019
Feature Idea: A specification for notebook output dependencies Notebook feature-idea	18	1582	August 12, 2021

Should Jupyter recommend a text-based representation of the notebook?

One starting point proposal

Related topics