This works for websites and also for documents like PDFs. In order for everyone to see the same annotations for the same PDF (even when you get your copy by email) hypothesis uses a unique identifier for the file. For PDF files this identifier is part of the format (I think). For more details on how hypothesis uses it checkout https://web.hypothes.is/help/how-hypothesis-interacts-with-document-metadata/#what-happens-when-urls-change
One idea on how to get similar functionality for notebooks would be to add a unique identifier to the metadata of the notebook and for renderers (classic notebook, jupyter lab, nteract, nbpreview) that want to allow annotations to then render something like a <link rel="canonical" href="http://notebooks.jupyter.org/<identifier-for-the-metadata>" in the HTML they generate. Or maybe a meta tag is better than using a canonical as the canonical link is also looked at by search engines.
How would you generate this unique identifier? Maybe it is enough to generate a random 32byte value when the notebook is created. For those who are interested the code that PDF.js uses to generate the fingerprint for PDFs is here.
What do you think of adding a extra field in the metadata and setting it to a random value on document creation? Then rendering that in the HTML version of a notebook so annotation tools can use it as identifier?
There are two levels at which identifiers in a Jupyter notebook might interact usefully with annotation software.
Element level. Fernando Perez suggested, some years ago, that per-node identifiers could be important. A t the time, as I recall, they didn’t exist. But anyway the idea is that while anchoring annotations to selections of text in a rendered notebook, by default Hypothesis will do it based only on the position of the target selection in the stream of rendered text, and on the target text itself, surrounded by a prefix/suffix context window. Maybe that’s fine, but it could be interesting to anchor annotations relative to nodes in the notebook, if they are identified, and depending on how that identification surfaces in the rendered page. If none of this is readily available, then the notebook is just a web page from Hypothesis’ point of view, and it deals with it in the way it normally does.
Document level. Here we want URL-independent identifiers. There’s no need to follow the PDF fingerprint model, something human-readable would be better. I would not recommend rel=“canonical” but rather the dc.identifier/dc.relation.ispartof metadata pair described here: https://web.hypothes.is/help/how-hypothesis-interacts-with-document-metadata/#what-happens-when-urls-change. The two parts combine, and you can use the identifier/relation pair however makes sense.
Making the annotations independent of matching the HTML would be nice so that they end up in the same place across UIs (they presumably generate different enough HTML).
Each cell has a metadata field as well so we could have a unique ID there as well. Would the dc.relation.ispartof tags work for cells to indicate they are part of the whole document or how would it work?
This would be very helpful. If the next major revision of nbformat is JSON-LD, these ids could be the @id for the e.g. nbformat:InputCell < schema:CreativeWork.
This says comments are stored in a comments.db which presumably needs to be merged separately?
It’s likely possible to run a private instance of hypothesis/h with ideonate/jhsingle-native-proxy or ihenry42/jupyter_wsgi, but IDK how to handle spam or moderation; integration with JupyterHub authenticators would be cool.
At this point I think we should make a JEP proposal for the change. The problem is well outlined and the solution seems defined enough to get potential consensus from the larger community I think. If you wanted to do an initial draft for that it would help, I’m a little swamped in other threads around async and nbconvert 6 (if we can ever get it fully released ). But I’d be glad to chime in or help review proposals with the time I do have currently.
It just doesn’t have any tooling built around it as far as I know. This is an issue we’re running into in another project as well, where we’d like to be able to refer to specific cells.
The issue is that name is not required to be unique (only should be) and has no requirement to be present, making it not ideal for consistent identification. Name is almost always going to be a field type in any system that’s human friendly but not machine friendly to use.
I agree - just wanted to note where there was some steps in this direction already. IMO having an “ID” that is more restricted (e.g. no spaces, etc) along with a “name” field would be great. Creating a new notebook could auto-generate IDs for each cell (e.g. generate a hash each time a cell is created) and then UI could make it easy for people to over-write Cell IDs if they wish for something more human-referencable.
Is it worth opening an issue specifically about that, and taking conversation specific to that point over to nbformat?