Best practices/recommendations on safe HTML exports

Hello,

I am looking for recommendations or best practices on how to safely do exports of arbitrary user-submitted notebooks to HTML. I am going to be accepting user-uploaded notebooks and displaying rendered versions in my web application.

(I have previously opened an associated issue in the nbconvert repository, but have so far not received any response. I think any relevant information could also be helpful to include in nbconvert’s documentation.)

I have seen that there is a sanitize-html / should_sanitize_html option when using the HTML exporter. My understanding from looking at the code is that cells are run through the clean_html filter. Some questions:

  • What are the safety implications if just using default settings and not using should_sanitize_html?
  • How should I understand this filter’s level of safety in a broader context?
  • This doesn’t appear customizable (in an obvious way—I guess some of these allow lists could be monkeypatched?). Should this not be customized? From trying to use it, it seems like paragraph and header tags are not allowed, which seems to break fairly basic markdown formatting in notebooks.
  • Are there other basic vulnerabilities to watch out for that using the sanitize option doesn’t address?

One notable model for rendering user-uploaded notebooks is GitHub. I understand that GitHub does some kind of cleaning or places restrictions on the rendering, but I haven’t been able to find details or code about what that actually is. If anyone knows of a reference about this, that would also be very helpful.

Thank you in advance!

There are no completely “safe” ways to do this, at the file level, and most of the “sanitizing” tools will end up breaking legitimate user content while not defeating “0-day du jour”: a lot of neat things might be lost, and there are too many “special” cases to mark “safe”… the work-arounds for which might be faulty in their own right (e.g. a CDN is compromised).

Anything “fun” is at risk: JS (clearly), CSS, SVG, widgets, plotly, bokeh, mathjax or any of the other reasons one might want to spend the time using a notebook vs writing on a napkin and taking a picture.

… use an entirely separate renderer than nbconvert, and are often rendered incorrectly, for the reasons described above. To my knowledge, nobody from the Jupyter community has ever seen this code, but we do routinely have to field questions about “Why is my private notebook not rendering correctly on a proprietary platform?”

At the architectural level, however, one can take notes from how this is handled for other things.

nbviewer.org has no user credentials or any dynamic capabilities… even the cahce reset thing reliably doesn’t work :crying_cat_face: And sure, it would render a notebook that contained 1000 iframes of itself, but the cost of that is… low, given the architecture/caching layers on top of it.

GitHub handles other user content fairly well: github.com carries read/write user credentials, but is an entirely separate domain (not just a port) from user-generated content such as githubusercontent.com or {user/org}.github.io.

If an application’s architecture allows it, having a separate domain (and cookie, etc) for rendered (or even viewed) user content is the best way to be sure, while respecting the content created, and can be relatively painless for the viewer via iframes, etc.

So the URL:

  • https://my-cool.app/{user}/{project/version}/{path/to/notebook.ipynb}

… could be a header and an iframe with a src of:

  • https://my-cool-user.themoon/{hash of(salt, user, project, version)}/{path/to/notebook}.html

And then standard sandbox attribute can, for example, allow-scripts so interactive content works.

It can even be possible to use the “safe” postMessage API between domains to allow an (approved) set of deep links back out, e.g. for linking to other user notebooks.

3 Likes

Thank you for the discussion about architecture and iframes. That was very helpful.

I have some follow-up questions about the sanitization. It makes sense that there is no “completely safe” way to handle user content through sanitization. I’m wondering if there are still helpful approaches to sanitization that balance safety and utility by mitigating simple or common vulnerabilities while still allowing for some of the useful functionality. IMO, even a notebook with zero interactivity is still more useful than a picture of a napkin, because it’s a nice way to show code and outputs together, and that is good for reproducibility.

Is sanitization useful/necessary if one is following the architectural approach you described above, so that cookies/credentials are not accessible by the user content? Are there types of attacks that would be bad—for some sense of “bad”—that it’s worth generally trying to project against? I understand that zero-day attacks can never be ruled out, but my intuition is that there is some amount of risk of zero-day attacks that is just too impractical not to accept (especially if the stakes can be limited).

If so, what might be a good rule of thumb? I know that even more generally than notebooks, GitHub, Stack Overflow, and Discourse all allow some subset of HTML when rendering Markdown. Are those reasonable baselines to emulate?