Jupyter and GitHub - alternative file format

Hello!

I’m working at GitHub and I’m looking at improving experience of Jupiter notebooks on our favorite website. As many of you surely experienced, Jupyter, while can be previewed and rendered in GH, is notoriously hard to review, diff and resolve conflicts. That complexity comes largely from .ipynb format itself, namely json with metadata.

What I was experimenting with is to create custom content manager to completely change format notebooks would be saved in. I was experimenting with notebooks-as-markdown that will retain all metadata needed to restart kernel from it and keep working on it, while being easy to read in raw text format. This raw text readability gives us ability to review, diff and resolve conflicts with native git manner.

Example markdown - https://gist.github.com/inc0/ff98d4eab2159a1fe8617e4799092611
As you can see there are some wrinkles to fix (for example, figures don’t render automatically from b64), but that can be handled later during implementation. Another idea would be to create brand new format thats comprehensible in raw text and not constrained with markdown that will then be rendered in GitHub as regular notebook.

Next step would be to provide nbconvert target to quickly migrate old style .ipynb files to new format.

I would love to hear your comments and maybe point me to previous discussions if there were any or criticism of this idea.

4 Likes

@betatim @choldgraf @willingc do you know the right person from the community that we may want to talk to re: the above?

Hey all - a couple of quick thoughts:

first - thanks for working to improve the ipynb experience in GitHub, it is much appreciated!

regarding a text-based version, do you imagine this being something that users will see? Or using this just under-the-hood for diffing, merging, etc? I say this because if you imagine this as a user-facing thing then I think it’d be best done in partnership with several stakeholders around the Jupyter community so you don’t unintentionally end up creating a new standard for notebooks w/o doing so thoughtfully. The ipynb format has undergone many, many iterations and discussion and we want to make sure the same thought goes into creating representations of it that would become usable on GitHub.

there is a tool called nbdime (for “notebook diffing and merging”) (https://github.com/jupyter/nbdime) that folks have used for many years. Perhaps it could be used as a part of this functionality in GitHub? There are also products that accomplish this, though I’m not sure whether their underlying code is open source (e.g., ReviewNB).

regarding notebooks -> markdown etc, I’d check out the jupytext project - that provides two-way conversion between ipynb and a variety of text formats. I’d recommend you leverage this tool and/or contribute upstream improvements to it if possible instead of creating a new github-specific markdown version of a notebook. The Jupyter Book project recently started using jupytext for its markdown notebook representation which has worked pretty well for us. Happy to chat about that if it’s helpful.

Finally - I’d also look around this forum for previous discussions that touch on this, for example:

1 Like

Thanks for reply!

So, regarding “something that users will see”, I hope this could be totally transparent (after including proper configuration), however definetely would require action from user, we don’t want to mess up existing notebooks. I understand reservation regarding .ipynb format overriding, but I think we could make it robust enough. As I understand (admitedly, I’m just learning jupyter codebase), .ipynb is fundamentally json representation of model dict. I think we could create robust plain text readable representation of same data structure with emphasis of rendering cells and it’s outputs conveniently.

User story would look like that:

  1. Group of users agree to use new-ipynb-format
  2. Everyone configures content-manager-class in their notebook configs to use class we provide
  3. run `nbconvert --to new-ipynb-format all of existing notebooks
  4. Commit, review etc with new format from now on

Jupytext is great example of what I have in mind! Difference being that instead of having 2 representations - python/md and ipynb + sync, we would have one single file that would be alternative to ipynb itself.

1 Like

A few more quick thoughts there:

  • Whatever format ends up being used, it should absolutely have lossless, easy two-way synchronization with the current ipynb format. Even if you don’t use jupytext for two-way sync, I’d pick one of the formats that jupytext supports, and stick with it.
  • The problem is not that it would be technically difficult - as jupytext shows there are several ways to implement a notebook as text - the problem is defining and adopting a community standard. GitHub is a special case in many ways because of its large reach, so we’d need to be extra super careful about deviating from standards.
  • If it’s not obvious from the above points: I don’t think it’s a good idea to invent a new user-facing text-based notebook format as an ad-hoc solution to better diffing in github. It runs the risk of fracturing community workflows, and would also be a format that only works on one service provider, which I don’t think would be healthy for the community. Instead I’d piggy-back on a pre-existing text-based notebook format, or begin a community process of discussion around creating a non-JSON standard for Jupyter notebooks.
  • If a non-JSON standard is really needed, I think the way to do this is to use a Jupyter Enhancement Proposal (https://jupyter.org/enhancement-proposals/README.html). These are ways to facilitate discussion, brainstorming, and debate across the Jupyter community for issues that are far-reaching and impact many stakeholders.

Not trying to come across negatively here - I think there’d definitely be value in having a standard story around text-based notebooks. But, we should be extremely careful and conservative about introducing a “new” jupyter notebook format that is meant for end-users. The ipynb format is one of the most consistent and important standards across the Jupyter community, and is relied upon by countless open source projects, web-services, etc. I’d really want to avoid the path where we users are told by a platform as large as GitHub to convert their notebooks into a new notebook format that is not a widespread community standard.

* Note: I made an addition to the post above - there are already tools that facilitate diffing and merging with the ipynb format (nbdime) and a web service that provides a UI around this process (ReviewNB) so it is certainly possible to do this without inventing a new text-based format.

1 Like

cc also @mseal or @jasongrout who may also have thoughts on this

@choldgraf did a good job outlining things. I’d encourage asking for better diff options in platformas like github to better support notebook diffs (with notebook diff tools linked above or others) as first class views or installable extensions as a first step over changing the file format.

I don’t remember the tool but there was someone also saving .ipynb in github as .yaml for a better diffing experience, but they were translating back to ipynb json format before any other tooling read the files.

I agree completely about being conservative. Only reason I think it’s feasible approach is that I believe we can find good format to allow seamless and lossless 2 way conversion between ipynb json and this new format.

To clarify too, reason I’m looking at this change is less about GitHub itself and more about version control in general. Change like that would benefit from every tool for software dev - whether it’s GitHub, GitLab, git itself of Mercurial, that’s why I think it’s worthwhile.

I see this thread Should Jupyter recommend a text-based representation of the notebook? is discussing exactly this idea, so should we move to that thread to avoid duplication?

As you can tell a lot of thinking, discussing and work has gone into this topic already. It is also the kind of thing where it looks easy but turns out to be hard when you get beyond the first 80% of the project A bit like crypto: it looks like you should be able to roll your own and thereby avoid complexity but once you have gone down that road beyond the first prototypes you figure out that learning how to use an existing library and why it is so complex would have been the better move.

I don’t think there is any one person. Anyone can (and does) create their own contents manager and idea for new file formats to store notebooks. It happens a lot, most never reach a high enough level of adoption and get abandoned “at the end of the summer”. I think this is great because it is a hard problem to solve, so we need people to try stuff because one day someone is going to crack it.

Some thoughts on how I’d tackle this kind of project:

  • take a step back and talk to real humans about when they want to diff a notebook
    • how did they arrive in this situation?
    • what is the story that leads up to this moment, what happens after this moment in time?
  • share the result of this with others to build a community of people who agree and want to work on this
  • research and read and poke around the code of as many of the existing “turn JSON notebooks into line by line diff’able files” projects as you can stomach
  • build a few prototypes and use them on real world notebooks to get a feeling for where the gnarly edge cases and how to handle them

My bet is that whatever you find out as the problem to solve from talking to people and learning about their situation will point towards the solution involving a large social component. This means a lot of the work will be around building consensus and having patience. Not necessarily convincing people with facts (there are people out there who “hate notebooks” which tells you that this is an emotional topic and not purely a logic puzzle).

4 Likes

Check out the Wrattler project. This is a polyglot notebook in Markdown form with a full dependency graph (no need to worry about running cells out of order!) https://github.com/wrattler/wrattler
There is even a Binder environment! https://github.com/wrattler/wrattler-binder

3 Likes

Thanks, @betatim … sorry if I caused too much confusion on this. My goal was to mainly connect some folks with my colleague who has been tasked with working on this. Thanks so much for your comments and point of view it is certainly very helpful.

1 Like

No worries. My comment was meant as a “beware, here be dragons, spiders and bottomless pits. also quicksand to get stuck in” not a “please go away and don’t open this can of worms again”.

1 Like

This is exactly what I want to do with my project GitHub - mgeier/jupyter-format: An Experimental New Storage Format For Jupyter Notebooks.

Sadly, this has received very little feedback up to now (but I haven’t really pushed it either). @inc0 It would be great if you could give it a try and give some feedback!

I think it would make sense to see this not as replacement but as an alternative format to .ipynb. My suggested format is fully two-way compatible with .ipynb.

Thank you for pointers:) Here is my thinking about social approach to this matter. First, yes, a lot of people explore it and I’m very happy about that! It means we’re onto something. I hope, from this discussion, we will be able to get few commited humans to this idea and together derive some yet another prototype.

To answer your questions @betatim, how we arrived to this - basically when we asked data scientists (over last few months) about what they want from GitHub most, notebooks improvements were definetely most requested changes, diffs, pull requests, conflicts. We also maintain quite a few notebooks ourselves, so we feel the pain.

Building community around this problem is where we are at, that’s my biggest goal from this post:) This also gave me ton of research and few prototype ideas I’ll be sharing.

I agree with @choldgraf that biggest question is format itself. Let me try and write down few requirements such format should meet:

  1. Loseless two way conversion between it and .ipynb format
  2. Easy to render
  3. Easy to read and comprehend in raw text, biggest emphasis on code block readability and cell output

Thank you @sgibson91 for Wrattler pointer! It looks amazing!

@mgeier I’ve looked at jupyter-format yesterday and that’s exactly kind of work I was hoping to start too. I was thinking of yaml initially too, but writing brand new renderer for opinionated yaml will be tricky, so I went more towards markdown. I had exact same idea about general approach tho, so it’s exciting:) agree on format and write content manager that will allow to transparently replace existing ipynb was my goal. What are your thoughts on MyST as backend format? It looks very promissing and I wanted to experiment around writing 2 way loseless conversion between notebook and myst.

1 Like

(Knowing that this has become a talk about community building, I wanted to continue shining light on the tools that are available because maybe we just need to raise the profile of some thingssl before we do a huge redevelopment.) A workshop I ran back in January produced a GitHub Action to clean notebook outputs. This at least helps reduces the probability of messy diffs and merge conflicts https://github.com/ResearchSoftwareActions/EnsureCleanNotebooksAction

4 Likes

Putting in my shortish input:

I would like to reiterate the “Here be dragons” bit from above. The ipynb format is one of the most standardized parts of Jupyter. There’s dozens, if not hundreds, of libraries that rely on that format and are built on top of the ipynb json structure. It’s well specified, schema’d, and extensible with the metadata field namespaces.

That being said, if you had an alternative format for notebooks, I would insist it NOT use the .ipynb extension as that would cause much much more confusion for backwards and forward compatibility. Also ensuring it preserves the same fields (even if in a different medium) for interoperability is critical. Yaml is a good choice here for better diff / human readability if that’s the end goal (and it’s a superset of JSON).

I’d also encourage making diffing tools and plugins more readily aware of ipynb format rather than changing the format to match the tools as line of thinking that should be exhausted as well (some discussion on that already present). There’s been several libraries and implementations that help with this, but they’re not well integrated into common platforms like github.

3 Likes

As other posts in this thread have identified, there has been a lot of work in various projects relating to alternative formats to ipynb over the years. FWIW, here are some idle thoughts from an amateur observer of the Jupyterverse…

Two projects that I think have a lot of interesting features are:

  1. Jupyter Book / MyST: arguably a leading candidate for the official unofficial (sic) HTML et al publishing route from notebooks to interactive HTML + other formats, it would make sense to be compliant with this and the sorts of metadata / extensions it consumes from the notebook side and renders on the output side;
  2. jupytext: to my mind, the pandoc of ipynb (if that makes sense!) at least in terms of input cell transformation across doctypes; Jupytext excels in trying to provide lossless conversion between input cell and metadata markup across a range of text document formats. If you’re inventing yet another format, seriously ask yourself why none of the current Jupytext supported formats are appropriate and beware the https://xkcd.com/927/ trap; no support for output cell content is a fair enough reason, but there is discussion across various Jupytext threads (eg https://github.com/mwouts/jupytext/issues/220 ) about how this might be supported and it would be worth considering. Bearing in mind Github’s ultimate owner, a chat with the VS Code Python extension product team probably wouldn’t go amiss…

I also note:

  1. diffing is obviously a big thing for Github on two counts: a) across input cells, which are likely simple text; b) across output cells, which gets trickier: outputs could be text, images, video, audio, embedded HTML+js+css pages etc etc. There are already diff-ers out there that have already been referenced: eg nbdime: can they be evolved in a sensible way to support their (community) aims as well as yours.
  2. perhaps not obvious in the first instance, but maybe at a second glance: should the format support embedded cell tests? This could be interesting the Github case, if the doc format embeds cell tests that can be run as part of a CI process; there are various test frameworks in the Jupyter context eg nbval or nbcelltests and probably others…
2 Likes

Agree on not using ipynb extension. That would be bad. Yamls + code doesn’t work well as you lose stuff like syntax highlighting and well, indents in python + indents of yaml together are…uhh. That’s why I was thinking of mixture of markdown for cells and outputs and hidden yaml for metadata. Alternatively something like MyST so markdown with addition of comments/data structures that won’t be rendered.

perhaps not obvious in the first instance, but maybe at a second glance: should the format support embedded cell tests? This could be interesting the Github case, if the doc format embeds cell tests that can be run as part of a CI process; there are various test frameworks in the Jupyter context eg nbval or nbcelltests and probably others…

On that note, I’d actually think of notebooks more in the light of how you’d test a .py file in most (perhaps not all) cases. You usually don’t give someone a script with the unittests at the bottom, you have the tests as separate files to confirm behavior. There’s a Google Summer of Code project I’m helping mentor for nteract that’s aiming at this effort, which may be of interest:

It’s still pre-alpha but the first release should be in a week or two. There’s some prior work for unittesting notebooks by stashing tests in the metadata, but I think testbook has a change to be able to make notebook testing more natural to standard development patterns and interfaces. (Also the student is looking for some early adopters to give feedback if anyone where wants to try it out – feel free to direct message me if you want)

1 Like

When people reference diffs and pull requests, do you think they mean “viewing diff on GitHub” or “viewing a diff locally”? My hunch is that what people are thinking of is “on GitHub” (or more generally “the platform they use”). If you have some data to confirm/deny this hunch, that would be interesting.

Instead of changing the on disk format what about changing the displayed diff? A “hacked on this for a couple of hours” tool that makes nice diffs when you run git diff some.ipynb is to setup a git filter that converts the notebook to markdown and leaves out all base64 encoded outputs.

A more sophisticated version of that (dealing with edge cases etc) which tries to display something when the content can’t be easily displayed in a terminal is nbdime – diffing and merging of Jupyter Notebooks — nbdime 4.0.1 documentation. It creates something like:

nbdime console output

What is the thinking around displaying something like this as the “diff view” on GitHub?

1 Like