By chance, I just came across the Library of Congress Sustainability of Digital Formats that has a schema for cataloguing digital document formats as well as a set of criteria against which the sustainability of digital documents formats can be tracked.
Sustainability factors include:
- Disclosure: specifications, schemata;
- Adoption: extent of use;
- Transparency: eg human readability, text format;
- Self-documentation: extent to which format is self-documenting;
- External dependencies: eg hardware, o/s;
- Impact of patents: patent encumbered; ("…and licensing" would perhaps a more useful generalisation of this field?)
- Technical protection mechanisms: eg encryption.
There are also fields associated with Quality and functionality factors which for text documents include: normal rendering, integrity of document structure, integrity of layout and display, support for mathematics/formulae etc., functionality beyond normal rendering.
I note that .ipynb
is not currently on the list of mentioned formats. Records for geojson
and Rdata
provide a steer for the sorts of thing that an ipynb
record might initially contain. (I also note that Python / Jupyter kernels don’t have a standardised serialisation format akin to R’s .rdata
workspace serialisation (dill
goes some way to towards this, maybe also data-vault
. I also appreciate this is complicated by the wide variety of custom objects created by Python packages, but just as IPython supports rich display integration through __repr__
methods (see also the notes at the end of the IPython.display.display
docs for a description of what methods are supported), it might also be timely to start thinking about __serialise__
methods (they may already exist; there is so much I don’t know about Python! I do know that things don’t always work though; eg Python’s json
package in my py envt breaks when trying to serialise numpy.int64
objects…).)
There is now a significant number of notebooks on eg Github, as well as signs that notebooks are starting to be used as a publishing format (or at least, as a feedstock for publication, whether rendered using nbconvert
or more elaborate tools such as Jupyter-book
, nbsphinx
, ipypublish
, or howsoever).
I wonder if it would be timely to review the ipynb
document format in terms of its sustainability and whether getting it included on the LoC list (or other appropriate forum) would be an appropriate thing to do for several reasons, including:
- signals the existence of the document format to the Library / sustainability community in terms the are familiar with and may be able to help with;
- help identify how
nbformat
should not develop in future in ways that might affect its sustainability as a format; - help identify things that might help improve its sustainability;
- help inform workflows and behaviours regarding how eg cell metadata / tags feed into sustainability.
If .ipynb
is to remain the core data-structure for representing Jupyter executable documents and their outputs, and as other third party applications (such as VSCode, or Google Colab) start to support the format, and if it doesn’t already exist, I also wonder whether a simple RFC style document (cf. the GeoJSON RFC) would be appropriate alongside the slightly less formal nbformat
documentation as a formal statement of the document standard?
Interoperability is driven by convention as well as standard, and if we are going to see external services developing around Jupyter from individuals or organisations not previously associated with the Jupyter community, but offering interoperability with it, there needs to be a clear basis for what the standards are. This includes not just the base ipynb
format, but also messaging and state protocols.
The nbformat
format description docs pages seem to act as the normative reference work for the .ipynb
standard, and I assume the Jupyter client - messaging docs are the normative reference for the client-server messaging? For ipywidgets
, the widget messaging protocol and widget model state docs in the ipywidgets
repo appear to provide the normative reference.
PS see also this recent workshop roundup on preserving computational notebooks.