Jupyter and GitHub - alternative file format

FWIW, I think this is great! But it’s just not going to work for the vast majority of data scientists. According to most recent survey, the majority (57%) of data scientists do not use Jupyter as their tooling (the way the survey is written, it’s not clear if 100% of the remaining 43% do, but let’s say they do). Wouldn’t it be great if we came up with a GNU based version of commenting that supported everyone?

Another thing to consider is that open source is largely done in GitHub, GitLab, Gerrit or other existing review managers. NBDime, while awesome for what it does, is separate tool just for one filetype. If we actually succeed at plaintext format, we will be able to use all of these in vanilla format, which is much more convenient imo.

1 Like

I don’t think that anybody would argue that ipynb is the best format for diffing, merging, and commenting. The trick is that we need to optimize for much more than just these use-cases. As @carreau mentioned there are many other constraints and optimizations for jupyter notebook formats. I don’t know if the current IPYNB/JSON structure is the right answer, but important to remember that the format that is best for one use-case may not be the format that is best at balancing all of the use-cases.

An example of another optimization direction is: “ubiquity of tooling to manipulate notebooks”. Any format that isn’t in a pre-existing standard data structure (e.g. custom formats like MyST markdown, or the jupyter-format described above) will have no parser in existence until somebody writes it in a particular language. That’s one reason to use a pre-existing structure like JSON.

I think that it would be helpful to define these high-level guiding principles before we start digging into prototypes etc, to ensure that the needs of Jupyter’s diverse community are being met. Things like:

  • What are the goals of the ipynb format?
  • What are its “must-have” features?
  • What are its “want to have” features?
  • What are any hard-constraints?
1 Like

If you want to test drive this checkout https://mybinder.org/v2/gh/mgeier/jupyter-format/master and open one of the existing .jupyter notebooks in the doc/ sub-directory.

This is what things look like if you add a plot to an existing notebook and the look at git diff:

mostly lots of base64 :smiley:

Is there a way to tell jupyter to save a file as .jupyter instead of as .ipynb for a newly created notebook?

This looks nice!

One difficult challenge with diff’ing and merging of notebooks (no matter how you store them on disk) is that they contain output (or links/pointers to output). This output can be text, in which case a terminal/text based viewer is enough to reason about it.

Often the output isn’t text though. In those cases a text based viewer is nearly useless (does anyone want to take a guess at what is in the output of cell on my screenshot above?). To make sense of it you need to render it as an image.

And if the notebook had already had output for that cell or you had a merge conflict on the cell output you need to render two or three images next to each other (potentially with a clever UI to help me spot the differences).

This means I will never be able to resolve merge conflicts in a text based medium if my notebooks contain rich output.

One solution is a technical one: use a viewer that can display the many kinds of rich output. Another is a social one: establish a convention of “no outputs in git” for your repository (it is like linting or code formatting).

Given that outputs in notebooks is one of the mega features of notebooks, I think removing the ability to store outputs (at the file format level) in notebooks is too radical.

The social convention route is an interesting one for people to deal with their pains. It has worked very well for code formatting and linting. Python, C++ and most other programming languages (one exception being Go) don’t forbid badly formatted code or files full of lint. Yet most projects use (auto)formatters and linters and enforce their use.

I don’t think there is any “native commenting” or “GNU commenting tools”. Which means it would be nice to find a global solution for all platforms but it is also Ok to have someone take a risk on building this and reaping the rewards (aka https://www.reviewnb.com/ getting acquired by someone).

Once you have nbdime installed, then a day or two later you’ve forgotten about it. You type git diff and it just works. Is it all that much harder to install nbdime than a pre-commit hook that applies formatting rules or a “format on save” editor extension? Probably not.


If you want to slice and dice notebooks on the terminal then it is a bit tricky with awk and sed, but luckily there is jq. As well as some other JSON acrobatic tools. For any custom format those tools don’t exist (yet) and depending on the format it might get harder to slide&dice with line based tools.


A ~2yr old issue on gitlab about rich diff’ing: https://gitlab.com/gitlab-org/gitlab/-/issues/22329

1 Like

This is partially why MyST-NB is designed the way it is. It is explicitly focused on the authoring, editing, and collaborating side of notebooks, and so it’s fine for us to use markdown because we don’t promise to do things like include outputs within the file itself. We assume that if users want outputs, they can use jupytext to pair their MyST-NB files with an ipynb file and leverage that for its full complexity (e.g. nice rendering on github). I think it’d be interesting to explore these kinds of workflows more strongly. I like the idea of having one format that is designed for humans and another that is designed for machines and coming up with ways to let them play nicely with each other.

I really like @choldgraf’s thought about identifying key use cases/benefits to the format as it exists.

I’ll be honest, and I realize this puts me in the minority. I’ve always strongly disliked the merging of outputs into the notebook file. I get why that was there in 2012, but it merges concerns and, generally, screws up everything. If I wanted outputs, which I do, shouldn’t it be in an output file?

Additional idea - what if we came up with a convention that allowed us to have diff auto-exclude via regex certain sections? E.g. what if every output section included something at the beginning of the line - E_IGNORE_OUTPUT - hidden from user visibility when browsing a notebook, but trivial to add into any diff/patch based workflow.

From the man page - http://schacon.github.io/git/git-diff.html

-G<regex>

Look for differences whose added or removed line matches the given <regex>.

@aronchick that’s a clever idea! I think it’s similar to my ideas around re-working the underlying ipynb structure so that it is something like:

<for cell in cells>
    <cell input>
    <reference to cell output>
<notebook metadata>
<for output in outputs>
    <cell output>

Then you could still use JSON to read/write the notebook, the only hassle would be that the outputs aren’t in-line with the inputs. But from a “diffing” perspective, you could just have the convention “forget everything after the beginning of the outputs section”

1 Like

As a longtime lurker and enthusiastic user of Jupyter tools (but no way associated with the core team), I wanted to add a few points on usage based on my experience analyzing Jupyter usage in the past (full disclosure: I am a data scientist at IBM)

In this particular survey, I think it is more accurate the combine the general Jupyter usage (43%) with JupyterLab (11%) and Colaboratory (4%) since the usage of notebooks is typically very similar across these tools giving an edge (58% total) to Jupyter tooling overall. If you remove things like IntelliJ IDEA, since Java is not commonly used for data science work, the share would be even higher.

There is also the fact that many of these surveys do not disambiguate usage of Jupyter as a standalone (e.g., JupyterLab or Jupyter Notebook) vs a file type within an existing IDE (e.g., if I use PyCharm to view and edit notebooks, do I put down PyCharm or Jupyter?)

Another consideration to weigh when looking at suggestions to the file format is that notebooks have a massive user base of folks that may not consider themselves data scientists or developers. Think folks like research scientists, data analysts, and even MBAs (https://www8.gsb.columbia.edu/articles/ideas-work/new-languages-business-python-sql-r) that expect notebooks and notebook files to behave a certain way.

Jupyter is replacing tools such as VBA + Excel for these users and many of the things that are a pain for source control (e.g., everything being in the same file like inputs and outputs) are actually part of the value proposition. My experience in this particular aspect is that non-developer users that adopt Jupyter (but are not aware of nbdime) end up trying and then abandoning Github in favor of sharing their work using alternatives such as cloud file sharing services (e.g., Onedrive, Google drive, etc) due to a perception that Github does not properly support the file type.

Other folks have articulated technical arguments around being careful with changes to the file type so I will just add that if Github were able to support notebooks in a way that does not lose any of the native functionality then it can tap into an expanded audience.

1 Like

Sorry folks that I am joining this thread after 69 other messages. A few general thoughts:

  • We should pin @choldgraf’s earlier response as a good starting point.
  • I’m glad that there is an enthusiastic discussion and a variety of inputs and options being discussed.
  • While the .ipynb format may not be ideal in all cases, it is currently the format that provides the best likelihood of reproducibility and execution of the notebooks across tools
  • While I appreciate ML as a driving use case for change, there are other use cases that are very important (education, science, etc.).
  • Forking the format or creating additional “notebook” formats will make it much more difficult to maintain the ease of execution. As a “de facto” standard as evidenced by adoption and the ACM award recognizing it as such, the highest likelihood for success would be to work through the JEP process and continue to refine the existing notebook format as a standard and providing points where extending the notebook format can be done while preserving reproducibility and compatibility.
2 Likes

Hi @aronchick @inc0 @hamel Great to see the good work being done with MLOps over at GitHub. I know you are all enthusiastic about improving the visibility and use of notebooks. A great next step would be to draft a JEP (Jupyter Enhancement Proposal) with detail on your proposal.

Thanks! Carol

I think this is a really intriguing idea. Conventionally, the notebook writes out files with the input broken into different lines (i.e., as an array of lines, rather than a single string with embedded newlines) to ease diffing and browsing. Practically, this is useless when there are outputs that swamp the input cells. Having the inputs together with outputs separate would help a lot.

It still wouldn’t be as easy to read as YAML or something like that, but would be a big improvement.

That’s fair, but left off of this is the enormous amount of people who just use straight python - the majority? - in a script. But, to be honest, the specific numbers don’t matter. Let’s say we use the absolute most generous number for *.ipypb usage - 60%? - all I’m saying is that a HUGE percentage of data scientists (40%?) use something else to do ML, and that number is growing as a percent - based on GH analysis.

I totally agree! However, in all cases, the Jupyter format in its current form is effectively incompatible with any GNU based diff/patch (not GitHub - ANY GNU tooling/workflow. At least in a way where the notebook isn’t treated like a black box.

Again, we’re the outsiders here - if you all think that things are great, then we should really just focus on other things. But if this community DOES care about Git/GNU based workflows, we should try to collaborate!

Again, as a dev/PM (not a GitHub/Azure employee), based on my analysis of the file format, I believe this to be EFFECTIVELY impossible. JSON’s flow, the way that opaque outputs (e.g. images, etc) are included in the notebook, the monolithic nature of the file all combine to make any hope of “Just flip this switch (or put 100 devs on it) at GitHub and it’ll just work” seem out of the realm of possibility. I just can’t reason in my head how to do it with the file format as it currently exists.

We COULD add additional metadata that might work but as we’re thinking about this, a pattern to target is the GNU/Linux friendly way of diffing that has been done since 1992 and has an enormous set of tooling around it:

diff originalfile updatedfile > patchfile.patch
patch originalfile -i patchfile.patch -o updatedfile

It’s a simplification to say the least, but it’s not ENTIRELY wrong to think of GitHub as a wrapper to this. If we can solve the above problem for Jupyter, you get GitHub for free :slight_smile:

Can you say more here? Would you consider an .ipypb file to be more reproducible than, for example, a .py file? I’d love to understand what I might be missing.

In that case, can you recommend the direction we should go? All the other tools mentioned in this thread - nbdime, nbconvert, jupyter-text, jupyter-format, reviewnb, jupytext, wrattler, testbook, nbviewer - use an alternate format to diff/view and, unless I’m wrong (which has happened MANY MANY MANY times before) the current .ipypb format is effectively unusable for any “good” diff experience.

@willingc is there any chance you’d like to lead this effort? Or kill it? Either way, I’d much rather someone inside the community who has awareness of the landscape and all the right folks take the reins.

1 Like

I did not say that ipynb is more reproducible than .py. Using ipynb now is that best way for the millions of existing notebooks to be most reproducible with the leading notebook tools.

Replacing .ipynb as the standard notebook file format or changes to the format would be a community effort. The way to do that is the JEP process. It has the greatest likelihood for long term success and prevent fracturing the existing community which includes science, education, government, enterprise and others.

Not to sound like “it works on my machine”, I’ve used nbdime and it does a good job for diff. Personally, I haven’t needed more than that in my daily work. Though I understand others have different needs.

While a whole new format may seem like a simple solution, I suspect a new format will bring along its own set of deficiencies that are different from ipynb.

While flattered, I definitely do not have the bandwidth to lead this effort. I also do not wish to kill it. My recommendation would be for someone to document the “needs” and propose a mechanism for addressing those needs as a JEP.

From the millions of users standpoint, I see a radical change from ipynb as a risky move, and improvement to meet the desired use cases is a better approach. My 2 cents and I’m sure others in the Jupyter community could offer thoughts too.

2 Likes

I agree replacing ipynb as standard would be gangartuan effort, and I don’t suggest we do this. What I do suggest is for subset of people who live and die by code reviews, this may be configuration they’d find beneficial. Just to clarify intents here. ipynb isn’t going anywhere anytime soon, if at all.

1 Like

+1 it was never my (or anyone, to my knowledge) to swap out ipypb. But at the same time, we can’t ask the folks who have been building diff/patch based workflows for the past ~30 years to swap either :frowning:

The unfortunate part here is this introduces yet another (incompatible) format - the nbdime “diff”.

This is such a bummer, but I understand. I was hoping because you have such a background and understanding of the project & the community, you could provide the guidance we need here.

I’m not optimistic about us moving forward without someone like you leading.

1 Like

@inc0 @aronchick I’m sorry if I misinterpreted your earlier messages. Though happy that we are on the same page about the ipynb standard.

Refocusing a bit…perhaps let’s chat later next week after the US holiday and you can give me a bit more context.

1 Like

@aronchick can you expand more upon why you are not optimistic (for the benefit of everyone?) It is certainly difficult to have this effort be contingent upon a core Jupiter contributor leading this effort. What I’m hearing is that everyone is asking us to lead and they can review, because the small group of core maintainers tagged on this thread do not necessarily have free bandwidth.

Perhaps we let @inc0 or @aronchick lead with a JEP (I’m happy to support with writing, as I know that is time consuming), and have that reviewed by some folks here? Perhaps that is a less onerous way of moving forward while still having buy in?

Thoughts?

1 Like

Let us know if you recommend a more appropriate forum for kicking off this effort such as some kind of Jupyter meeting, or we should proceed with continued comments on this thread. I want to be mindful of the fact that this thread is now an epic 80+ messages in, so just wanted to take the temperature on if this modality of communication is the best.

@willingc thanks as always for your support

1 Like