Over time we have seen an increasing interest in using Jupyter as a part of traditional text-based collaborative workflows (e.g. git, diffing, etc). There seem to be more and more projects that create their own text-based specification for how a notebook is structured. These projects generally do this in a very ad-hoc way that suits their own needs.
I suspect that this will only become more common, especially as excellent projects like jupytext gain traction.
To that extent, I am curious if folks think it would be useful to try and decide on a ārecommendedā text-based specification for Jupyter Notebooks. This doesnāt have to be an officially-supported spec in, e.g., JupyterLab etc, but it could be a community recommendation as others build tooling in the ecosystem, to try and reduce the forking paths of one-off specifications.
Do folks think it would be a good idea to, say, open up a JEP to begin discussion of this?
Note - Iām not talking about changing the default-supported markdown in Jupyter interfaces, but instead just talking about how one could represent the structure of the notebook (e.g., the content blocks, metadata about the blocks, and if/how to include outputs)
IIUC both jupyterlab-debugger and jupyterlab-lsp have their own shadow/virtual filesystems since standard tooling doesnāt work with the notebook json format.
Iām not really across the details so you might have to reach out to the devs for details.
In terms of prior-art, Iāve always thought @takluyverās nbexplode was a good idea which could solve a lot of problems. Sure, it might introduce some other issues but maybe it will solve more problems than it creates. If nothing else, there may be some lessons to be learned from itā¦
Thanks for the extra links @dhirschfeld! I also think nbexplode was a clever idea as well. I kind of wonder if there is an intermediate between nbexplode and the current state of things, where each notebook is 2 files:
A human-friendly text file that only contains the content of the notebook and some kind of cell structure
An ipynb file that has all of the metadata/outputs/etc that one needs to actually render the notebook
Then youād find a system to ensure that the two files are in-sync with one another. Maybe something like a server extension to auto-mirror changes when using a Jupyter interface, and default to āthe text file is the master copyā if any ambiguities pop up. You could already almost do this using the Jupytext server extension.
On the explode idea, I was using a decomposition of a notebook into a sqlite db for some search experiments way back when using sqlbiter.
Iām finding using Jupytext with linked ipynb and py documents useful. Tagging notebook cells with active-ipynb means these are commented out in the py paired document. The notebook can then have full rich output display, as well as lots of demonstrations of calling functions and displaying their outputs, running tests etc. The paired py document is loadable as a module exposing just the functions that arenāt in active-ipynb tagged code cells. A separate script can be used to filter out active-ipynb tagged cells exported to a py file to give a cleaner py file if required.
The idea of this format is that it is really human readable (unlike JSON) and that it produces nice diffs (unlike JSON) but it can still contain arbitrary outputs and metadata (just like JSON; in fact, using JSON for metadata). No information is lost, itās just a different serialization format.
nbexplode is a fun experiment that Iām still kind of fond of, but I donāt think anything like that will be practical unless version control systems change dramatically. The idea was that version control systems donāt know about any structure within the file, so why not use the structure they do know about: nested folders. But it falls down because folders are only dicts - thereās no equivalent of an ordered list/array of files.
We discussed at one point the idea of a ācompanion fileā: the idea that each notebook would comprise a text-based, version control friendly file of code and markdown, and a zip file containing things like rich outputs. You could either track only the first part in version control, or check in the companion file as well, but treat that like a binary blob, not trying to diff or merge it. But I donāt think it ever got implemented.
Thanks all for some great links and ideas. A few quick points from me:
To me this question is more around standards than around technology. E.g., there could be a simplified text-based specification for a notebook that didnāt have any specific tech behind it (maybe Jupytext could have a reference implementation or something).
As the links in this thread show, people out there are trying many things, and at some point thereās value in collapsing this space back to a smaller number of things so that we can standardize and start building on top of those new standards.
I think it would be a simpler process if we didnāt frame this as trying to replace the ipynb format, but instead how we can define a human-friendly way to store the content of a notebook, so that the ipynb file can be used to store all of the extra information most humans donāt care about in machine-readable ways.
One starting point proposal
To me, the most obvious structure to use would be something like Pandoc markdown block syntax mixed w/ RMarkdown cell syntax (I think it would be worth drawing on both of these as they are pre-existing de-facto standards even if neither is an āactualā standard in the way that CommonMark is)
Code cells are denoted with (ignore the backslashes):
Markdown renders code blocks the same way that we do now, with e.g.: \```python
Manual splits between markdown cells are created with ::: syntax, like
# This is my markdown
here is content
:::
## Here is another markdown cell
:::
And this would now be a third markdown cell.
If you wanted a different type of text cell (e.g. raw etc) youād specify it with a name in the ticks:
::: raw
Some raw text
:::
Metadata could be given one of two ways
As in-line attributes given in { }, where vals starting with . are treated as tags.
::: {key=val .tag1 .tag2}
Some content
:::
As YAML front-matter that is parsed first within the containing content of that cell, e.g. (again ignore slashes)
\```{python}
mygroup:
- mykey: myval
- mykey2: myval2
mykey3: myval3
---
# This is valid python
print('hi')
\```
Notebook-level metadata would be stored in a YAML header at the top of the page
Any metadata stored in the ācontentā format would be loaded into the ipynb format, and there could be metadata in the ipynb format that doesnāt make it into the text format.
You the file extension of this format would be .jmd or .imd, or I suppose it could also just be .md since all the other markdown flavors also just overload that file extension tooā¦
Taken by itself, this would only define the content structure of a notebook, it doesnāt know anything about outputs or programmatically generated metadata. Over time, tooling could be built to more cleverly handle synchronization between these formats
Over time, perhaps this specification could be extended to handle more complex information like outputs, but at a start weād keep it content-focused.
Thatās one idea but Iām sure there are many others to explore. I think itād be useful to do so in a structured way.
Iām unsure about this: it depends on the kind of problem the text representation solves. To me the main use of jupytext is the ease of version control and the ability of manual editing. What would be the goal of the globally recommended text representation?
Doesnāt this go against the notebook spec where the language would be global to a complete notebook? How would the case of multiple languages be interpreted? Or would this be an illegal notebook representation?
Minimize the number of text-based versions of notebooks by agreeing on a standard
Having more voices and opinions be considered in the creation of any one standard
Right now for example, Jupytext supports many text-based representations, each of which was created with a particular perspective in-mind. Thatās fine, but Iām sure that to some degree there is overlapping functionality and goals in each of those perspectives, and theyād benefit from a single format that could be jointly-used rather than multiple formats that were created as one-off solutions for a particular tool, community, etc.
The purpose of that sort of thing is for ātypesettingā of multiple languages. For example, instructions in bash, an example in yaml intermixed with executable code. The bookdown documentation has lots of great examples for this.
To what extent does Rmd fall short of what youāre thinking?
One advantage of the ipynb format is that is goes some way to capturing cell outputs, which can be a wide variety of mimetypes.
Document formats like docx have directories containing media assets (I think?) which allow documents to be self-contained in a zip package. A similar storage format could be useful if for example, you create a video or iframed HTML assets, although it would be nice if these assets were linked in a simple relative address way from within the main document.
copy/pasting the content of a notebook to another one (& templating)
refactoring the code in a notebook
executing or debugging a notebook as a script
rendering a notebook in another context than in Jupyter (e.g. as Markdown on Github)
In my opinion the format that is the closest to be a standard text representation for the notebook is the double-percent format (scripts with cells indicated with # %%, markdown cells with # %% [markdown] or # %% [md]). It has the longest history (was introduced by Spyder 5-6 years ago), and is supported by many other editors (Atom/Hydrogen, PTVS, VS Code, PyCharm Pro).
I think that, if Jupyter wanted to recommend a text format for notebooks, it should start with that one.
But to say the truth, it will probably not easy to have everyone agreeing on the format, even on this one. Clearly the specs should say how a code cell should be represented (all editor seem to agree with # %%). And markdown cells (not all editors agree yet, but Iām sure they will follow the Jupyter choice). Then, should the spec include notebook and cell metadata? Should the cell name have a special representation (Spyder cells may have titles, which unlike Jupyter cell names, may not be uniqueā¦)? Should Markdown cells be encoded in multiline strings? (Probably not easy to implement for all languagesā¦) Should the script follow PEP8 when all the input cells do?
That being said, another format that I like a lot is Markdown. It is a great format for writing documentation, and it can be edited/previewed in many editors and platforms[^1]. It naturally accepts all the programming languages that one can use in Jupyter. But there again, itās easy to start, i.e. decide to represent Markdown cells as text, and include code cells within Markdown code cells prefixed with ```python, but itās harder to go to the next step and decide how to separate consecutive Markdown cells, how to represent notebook or cell metadata, or as you mention, define which part of the code are executable, or not.
For now in Jupytext we have given the precedence to the principle that the text version should look like the notebook when previewed, and hence used e.g. HTML comments to represent the cell breaks and include the metadata, but that does make manual typing a bit cumbersome.
Regarding the idea of companion files, @takluyver, we did implement that in Jupytext, and it really is a great idea. Itās so convenient to be able to edit any of the representations, either text or .ipynb. Also the complete notebook (with outputs and metadata) is always available in the .ipynb file.
Finally, a word on outputs. I like very much the idea of saving them in a companion folder, in addition to the Markdown file, and Iād be curious to work on that when time permits. This way, we would avoid the duplication of the notebook inputs. And, if thinks were done well, we could directly use that representation of the notebook as the input for a Jekyll or Hugo blog post or chapterā¦ But maybe that leads to too many other questions (e.g. how to include the ouputs in a Markdown file?)!
[^1]: for that reason I have a preference for Markdown with .md extension rather than for R Markdown with .Rmd extension
One minor part of the idea, where it sounds like your version differs, was that the companion file would be a binary format (most likely zip), so that tools like git wouldnāt even try to diff or merge it except with a plugin. The idea was that treating it like a binary blob, where you just see that itās changed with no details, would be a better experience than text-based diffs of JSON.
This is maybe not that important, and maybe designing formats around a specific external tool is a bad idea anyway, but it seems worth remembering for a discussion like this.
Thanks @takluyver, yes indeed I liked the idea of the zip file! They are easier to share than a master file + directory, and thatās true, Iāve seen plugins able to show the nested differences when required.
Also I agree with your comment above that files are not naturally ordered. Maybe something we could do is to give explicit default name to the cells like e.g. unnamed_code_cell_1. That would be an invitation for the users to name their cells if they want a) less diffs in the ouput names and b) more meaningful output names. R Markdown does this, and I found it more natural than the random output names generated by nbconvert .
But before that Iād like to find a convincing way to include the outputs (other than images, e.g. text and HTML outputs) in the main Markdown documentā¦ Did anyone look at that question before? Can I use <iframe>, or something like Jekyllās {% include_relative ... %}? Any chance that I can use a shortcode in Hugo that would be compatible with Jekyllās include_relative?
Having # %% presupposes a line comment starting with #, which may not be true in a particular notebookās language.
The one language we do know exists in a notebook is Markdown. I think we give up something if a text-based format cannot be run as a normal file in a target language, but we also gain something if the text-based format is language agnostic.
Yes, I did not mention thatā¦ the comment char # in # %% actually stand for the languageās line comment, at least thatās how we implemented the support for 18 languages in Jupytext (I wish we had the information of the line comment in the notebook itself )
Agreed. And for some notebooks (tutorials, books, documentationā¦) I do love using that format. But for some other notebooks (those with a lot of code), I may prefer the script format, since it lets me refactor and edit the notebook in the IDE.
Hereās a short example of how a notebook will be represented in that format:
---
kernel_info:
name: python3
language_info:
name: Python
title: "My notebook title"
comment: "If any of the above aren't specified then use jupyter defaults"
---
# Markdown syntax
## Cell breaks
We can manually break markdown cells quickly with this syntax
+++ {"cell": "meta", "cell2": "meta2"}
## Markdown metadata
We can also explicitly separate a markdown cell and configure it like so:
```{markdown} tag1, tag2
---
key: val
---
## Here is some *configured* markdown!
```
We can also provide a `:key: val` shorthand for configuring
```{markdown} tag1, tag2
:key: val
## Here is some *configured* markdown!
```
## Executable code
Code is always executed with 'execute' blocks, like so:
```{code-cell}
print('this would be run by the front-matter-specified, or default, kernel')
```
You can also add metadata to these
```{code-cell} kernelname
:key: val
:key2: val2
:tags: ["tag1", "tag2"]
print('some python with cell metadata')
```
and that's it!