Sustainability of the ipynb/nbformat Document Format

psychemedia · January 13, 2020, 12:23pm

By chance, I just came across the Library of Congress Sustainability of Digital Formats that has a schema for cataloguing digital document formats as well as a set of criteria against which the sustainability of digital documents formats can be tracked.

Sustainability factors include:

Disclosure: specifications, schemata;
Adoption: extent of use;
Transparency: eg human readability, text format;
Self-documentation: extent to which format is self-documenting;
External dependencies: eg hardware, o/s;
Impact of patents: patent encumbered; ("…and licensing" would perhaps a more useful generalisation of this field?)
Technical protection mechanisms: eg encryption.

There are also fields associated with Quality and functionality factors which for text documents include: normal rendering, integrity of document structure, integrity of layout and display, support for mathematics/formulae etc., functionality beyond normal rendering.

I note that .ipynb is not currently on the list of mentioned formats. Records for geojson and Rdata provide a steer for the sorts of thing that an ipynb record might initially contain. (I also note that Python / Jupyter kernels don’t have a standardised serialisation format akin to R’s .rdata workspace serialisation (dill goes some way to towards this, maybe also data-vault. I also appreciate this is complicated by the wide variety of custom objects created by Python packages, but just as IPython supports rich display integration through __repr__ methods (see also the notes at the end of the IPython.display.display docs for a description of what methods are supported), it might also be timely to start thinking about __serialise__ methods (they may already exist; there is so much I don’t know about Python! I do know that things don’t always work though; eg Python’s json package in my py envt breaks when trying to serialise numpy.int64 objects…).)

There is now a significant number of notebooks on eg Github, as well as signs that notebooks are starting to be used as a publishing format (or at least, as a feedstock for publication, whether rendered using nbconvert or more elaborate tools such as Jupyter-book, nbsphinx, ipypublish, or howsoever).

I wonder if it would be timely to review the ipynb document format in terms of its sustainability and whether getting it included on the LoC list (or other appropriate forum) would be an appropriate thing to do for several reasons, including:

signals the existence of the document format to the Library / sustainability community in terms the are familiar with and may be able to help with;
help identify how nbformat should not develop in future in ways that might affect its sustainability as a format;
help identify things that might help improve its sustainability;
help inform workflows and behaviours regarding how eg cell metadata / tags feed into sustainability.

If .ipynb is to remain the core data-structure for representing Jupyter executable documents and their outputs, and as other third party applications (such as VSCode, or Google Colab) start to support the format, and if it doesn’t already exist, I also wonder whether a simple RFC style document (cf. the GeoJSON RFC) would be appropriate alongside the slightly less formal nbformat documentation as a formal statement of the document standard?

Interoperability is driven by convention as well as standard, and if we are going to see external services developing around Jupyter from individuals or organisations not previously associated with the Jupyter community, but offering interoperability with it, there needs to be a clear basis for what the standards are. This includes not just the base ipynb format, but also messaging and state protocols.

The nbformat format description docs pages seem to act as the normative reference work for the .ipynb standard, and I assume the Jupyter client - messaging docs are the normative reference for the client-server messaging? For ipywidgets, the widget messaging protocol and widget model state docs in the ipywidgets repo appear to provide the normative reference.

PS see also this recent workshop roundup on preserving computational notebooks.

bollwyvl · January 13, 2020, 8:47pm

Thanks for starting this!

A related discussion is on the JEP for DAP. In general, we might want to separate the specification from the reference implementation, and embed the human-readable documentation inside the formal specification. DAP itself is a good example of this, with its toolchain. In the DAP case, the ability for a Jupyter spec to reference another spec, in both a machine- and human-resolvable way would likely be preferable to re-implementing or re-documenting it.

Once (more) formalized, including a concrete reference to these specs in so-constrained objects would go a long way towards self-description, while actually including the schema might be a bit too Goedel-Escher-Bach. Including $schema seems like the most straightforward approach. What this does not provide, however, is an easy means for a document to be, for example, both an nbformat.v4 document as well as a particular, more-constrained format. I am not sure if schema could be crafted in such a way as to make this self-describing.

As this would generally necessitate a major (breaking) change on both ends of the pipe, I would also advocate for (optional) inclusion of a list of JSON-LD context, which would permit much deeper, unambiguous integration with high-value metadata formats like W3C Web Annotation and PROV.

Finally, setting a goal for a computationally-lossless, yet publication-ready format would make sense. I submit that PDFA/2 is a format really worth considering for this role, as it is already the de facto (or indeed, de jure) format in a number of domains. In addition to the familiar features of PDF, it includes a virtual file system, such that a “Jupyter PDF” meant:

a PDFA/2
at least one .ipynb

betatim · January 15, 2020, 4:23pm

Minimally off topic: I like the idea of storing the ipynb inside a PDF when creating a PDF from a notebook. You could even store more stuff like files that describe the environment in which the notebook should be executed, etc. Super cool idea.

bollwyvl · January 15, 2020, 5:22pm

To further ~~not~~ digress,

storing the ipynb inside a PDF

this has been possible with pydf2 (though maybe not up to PDFa2 spec) for five-ish years!

files that describe the environment

yeah, a somewhat limited version of this existed for a time, at least for conda, as it could be used like conda env update --file Untitled.ipynb (looking in #/metadata/environment), but was pulled a few years ago. it was also brittle w/r/t cross-platform concerns (happens), and didn’t really capture more complex concerns: a formalization of repo2[not-neccessarily-docker-pretty-please] would certainly be part of a more modern jupyter environment description specification, with initial concerns being addressed on the Kernel Provider JEP.

bollwyvl · January 15, 2020, 6:01pm

I guess a concrete next step of this discussion would be a Big Ol’ JEP that proposed a roadmap for describing the motivation, work required, etc. before we even had a chance at

A straw man:

A new, top-level Jupyter org, e.g. jupyter-spec tasked with owning/publishing versioned machine- and human-readable specifications without reference implementations…

jupyter-specs        # comes from... 
  content/           # notebook
  environment/       # repo2docker
  notebook/          # nbformat
  kernel-messages/   # jupyter_client
  kernelspec/        # jupyter_client?
  markdown/          # ???
  pdf/               # ???
  well-known/        # notebook? nbconvert? repo2docker?
  widgets/           # ipywidgets

The rough structure of each:

(README|LICENSE|CONTRIBUTING|CODE_OF_CONDUCT|ROADMAP).md  # the usual
Makefile
.github/(templates|actions)
specifications-proposed/
specifications/
implementations/

Where each (specification|proposal) directory contained whatever is needed to:

common metadata (version, points of contact, etc)
formally describe the specification in a machine readable format (to the extent possible)…
- JSON Schema
- EBNF
… augmented with narrative-style docs
- notebooks
… from which, generate human-readable, cross-linkable HTML documentation
- probably a sphinx pipeline
… as well as a chosen spec target, e.g. IETF, W3C, ISO, whatever makes sense
a conformance suite of good (and pathological!) examples in non-language-specific formats

While each implementation would provide:

common metadata (repo, license)
the specifications supported
- to what level, perhaps
links to conformance test results…
- probably xunit
…or, if open source, or in some other way excitable from CI (e.g. SaaS), a way to run the suite

Thoughts?

psychemedia · January 17, 2020, 1:48pm

I think a top level Jupyter org would make sense as a single place to go to look-up specifications. There are increasing numbers of tools out there that offer ipynb support or hook in to Jupyter kernels, so having a single point of reference for developers of those tools and/or services would be a Good Thing.

Providing a conformance test suite would also be useful.

I note that the suggestion is to avoid reference implementations in such a site. This makes sense (the specification stands separate from the implementation), thought it might be handy for separate reference implementations to include (reference) examples of running things like conformance tests?

For each specification, making sure it’s available as a single, standalone text document is also useful. There is no ambiguity then in wondering if you printed off all the pages.

Noting the .github/(templates|actions), are there official or reference examples for any Jupyter related actions? eg in repo2docker context, I use a variant of the repo2docker-action to build and push containers from a repo to DockerHub. Reference actions for running conformance tests etc might simplify the test process for others etc. Things like Jupyter Book might benefit from a simple publishing action etc.

bollwyvl · January 29, 2020, 3:04am

Sorry for the delay in reply!

avoid reference implementations

Yeah: there’s nothing stopping a reference implementation from implementing a new, out-of-spec feature (sometimes you have to do it first), and using that as part of the evidence in a PR to get the specification changed, but it wouldn’t be “blessed” in any particular way by “living” next to the spec.

separate reference implementations to include (reference) examples of running things like conformance tests?

I think I’d see that as part of the data in the separate listing for the implementations: I would think self-certification by PR’ing a link to a predictable report location (and test methodology) is probably the simplest approach. The spec history shouldn’t have to deal with a bunch of automated PRs updating the conformance test results every time some implementation passes/fails the test (though generating something on gh-pages every day would be just fine).

single, standalone text document is also useful.

A very fine goal indeed! And whether as a PDF or HTML, it could still contain its machine-readable content.

reference examples for any Jupyter related actions

The various repos are kind of all over the place, with a mixture of free CI services being (not) used. While I have personally not used GH actions, I see very little downside to making them the go-to for new start stuff…

Topic		Replies	Views
Jupyter and GitHub - alternative file format Notebook community , idea	101	9859	May 31, 2021
Is the ipynb extension off putting to non-python users? General	2	882	May 8, 2019
Proposed-JEP: Investigate alternate, optional file formats Notebook	14	1178	July 13, 2020
How to Version Control Jupyter Notebooks Notebook blog-post	22	24493	March 8, 2023
Strip_invalid_metadata future? Notebook	1	237	January 3, 2024

Sustainability of the ipynb/nbformat Document Format

Related topics