Annotating Jupyter notebooks

This is picking up a discussion started several years ago at a workshop about how to enable annotation for notebooks.

What is meant with annotation? Tools like https://web.hypothes.is/ that follow the W3C standard (or coming standard?). More here.

To see it in action on an example head over to http://ivory.idyll.org/blog/2019-communities-of-effort.html which has a few extra buttons in the top right:

This works for websites and also for documents like PDFs. In order for everyone to see the same annotations for the same PDF (even when you get your copy by email) hypothesis uses a unique identifier for the file. For PDF files this identifier is part of the format (I think). For more details on how hypothesis uses it checkout https://web.hypothes.is/help/how-hypothesis-interacts-with-document-metadata/#what-happens-when-urls-change

One idea on how to get similar functionality for notebooks would be to add a unique identifier to the metadata of the notebook and for renderers (classic notebook, jupyter lab, nteract, nbpreview) that want to allow annotations to then render something like a <link rel="canonical" href="http://notebooks.jupyter.org/<identifier-for-the-metadata>" in the HTML they generate. Or maybe a meta tag is better than using a canonical as the canonical link is also looked at by search engines.

How would you generate this unique identifier? Maybe it is enough to generate a random 32byte value when the notebook is created. For those who are interested the code that PDF.js uses to generate the fingerprint for PDFs is here.

What do you think of adding a extra field in the metadata and setting it to a random value on document creation? Then rendering that in the HTML version of a notebook so annotation tools can use it as identifier?

4 Likes

There are two levels at which identifiers in a Jupyter notebook might interact usefully with annotation software.

Element level. Fernando Perez suggested, some years ago, that per-node identifiers could be important. A t the time, as I recall, they didn’t exist. But anyway the idea is that while anchoring annotations to selections of text in a rendered notebook, by default Hypothesis will do it based only on the position of the target selection in the stream of rendered text, and on the target text itself, surrounded by a prefix/suffix context window. Maybe that’s fine, but it could be interesting to anchor annotations relative to nodes in the notebook, if they are identified, and depending on how that identification surfaces in the rendered page. If none of this is readily available, then the notebook is just a web page from Hypothesis’ point of view, and it deals with it in the way it normally does.

Document level. Here we want URL-independent identifiers. There’s no need to follow the PDF fingerprint model, something human-readable would be better. I would not recommend rel=“canonical” but rather the dc.identifier/dc.relation.ispartof metadata pair described here: https://web.hypothes.is/help/how-hypothesis-interacts-with-document-metadata/#what-happens-when-urls-change. The two parts combine, and you can use the identifier/relation pair however makes sense.

1 Like

Making the annotations independent of matching the HTML would be nice so that they end up in the same place across UIs (they presumably generate different enough HTML).

Each cell has a metadata field as well so we could have a unique ID there as well. Would the dc.relation.ispartof tags work for cells to indicate they are part of the whole document or how would it work?

On twitter Tony pointed to https://github.com/jupyterlab/jupyterlab-commenting as something people are working on to create a new commenting system that works in jupyter lab.

Hey Tim, I lost track of this, sorry.

If the cell’s ID surfaced as a fragment ID, then the https://www.w3.org/TR/annotation-model/#h-fragment-selector could be appropriate.

As it happens, Hypothesis used to record a FragmentSelector with annotations but there wasn’t a compelling use for it.

Here’s an example of an annotation that did record a FragmentSelector:

https://jsoneditoronline.org/?url=https://hypothes.is/api/annotations/9mT3gsbQEeag7MOeBozdXQ

Drill down into target -> selector and you’ll find 4 selectors.

The value of the FragmentSelector (the one we don’t use any more) is main, because that’s the governing id (<div id=“main” role=“main”>).

If cells had ids, and if we reinstituted the use of FragmentSelectors, that could be a nice combination.

2 Likes

If cells had ids,

This would be very helpful. If the next major revision of nbformat is JSON-LD, these ids could be the @id for the e.g. nbformat:InputCell < schema:CreativeWork.

This says comments are stored in a comments.db which presumably needs to be merged separately?

It’s likely possible to run a private instance of hypothesis/h with ideonate/jhsingle-native-proxy or ihenry42/jupyter_wsgi, but IDK how to handle spam or moderation; integration with JupyterHub authenticators would be cool.

IIUC, with the durable ID @judell describes in Add unique ID to the notebook metadata · Issue #148 · jupyter/nbformat · GitHub , any central hypothesis WebAnnotation server could host comments / annotations / highlights on HTML renders of Jupyter notebooks.

When would the UUID need to be changed?

  • When copying a notebook
  • When creating a notebook from a template (~copying)
  • When nbgrader copies from a template

What sort of UI does this need?

  • “Generate new UUID” > “Confirm?” (maybe in the metadata editor?)
1 Like

At this point I think we should make a JEP proposal for the change. The problem is well outlined and the solution seems defined enough to get potential consensus from the larger community I think. If you wanted to do an initial draft for that it would help, I’m a little swamped in other threads around async and nbconvert 6 (if we can ever get it fully released :slightly_frowning_face:). But I’d be glad to chime in or help review proposals with the time I do have currently.

2 Likes

A quick note re:cell ids, is that a unique cell “name” is already in the spec

https://nbformat.readthedocs.io/en/latest/format_description.html

It just doesn’t have any tooling built around it as far as I know. This is an issue we’re running into in another project as well, where we’d like to be able to refer to specific cells.

Re: a Jep, I would love to see this happen @MSeal

The issue is that name is not required to be unique (only should be) and has no requirement to be present, making it not ideal for consistent identification. Name is almost always going to be a field type in any system that’s human friendly but not machine friendly to use.

1 Like

I agree - just wanted to note where there was some steps in this direction already. IMO having an “ID” that is more restricted (e.g. no spaces, etc) along with a “name” field would be great. Creating a new notebook could auto-generate IDs for each cell (e.g. generate a hash each time a cell is created) and then UI could make it easy for people to over-write Cell IDs if they wish for something more human-referencable.

Is it worth opening an issue specifically about that, and taking conversation specific to that point over to nbformat?

Is there a template for JEPs? edit: Yes there is! https://github.com/jupyter/enhancement-proposals/blob/9a608e88be32af66757785b8e0f48541e71388a8/jupyter-enhancement-proposal-guidelines/jupyter-enhancement-proposal-guidelines.md

Otherwise I think I’d combine the first post of this thread and https://github.com/jupyter/nbformat/issues/148# to make a start.

I’ll take a stab in a week or two here and make JEP for this.

1 Like

Ping @Zsailer as I believe he is also starting to coordinate efforts on this