Jupyter and GitHub - alternative file format

ipynb combined with nbdime allows today great diff reviews as shown e.g. in the jupyterlab-git extension (look at demo.ipynb on https://raw.githubusercontent.com/jupyterlab/jupyterlab-git/master/docs/figs/demo-0-10-0.gif)

I agree alternative representation like yaml or markdown have a value, and would love to have a markdown with preamble like [*] where one could define the server size, the type of kernel, the datasets to mount and the initialisation scripts to run. The markdown notebook render would be free to honor or not the preamble definitions.

This drives me to the root of the question which is IMHO the notebook spec where I see 2 issues:

  1. The format is defined by a json-schema (https://github.com/jupyter/nbformat/blob/a06f4c84738b338fee5ad6316b21918a8709b636/nbformat/v4/nbformat.v4.4.schema.json) which makes it easy to implement as a json but hard or even impossible to apply to markdown (there is no standard way to my knowledge that defines where/how to put all those definitions in the markdown?). So should we define how to apply the json schema to the concrete formats (md, yaml…) or move away from json schema and try to be more generic (not sure what it would look)?

  2. The format is difficult to evolve, not on a technical standpoint, but more on a community agreement aspect. As many implementations are usin ipynb and many users are also looking at that, any change seems to be very difficult to be adopt (see e.g. Parameterized Kernel Launch on https://github.com/jupyter/enhancement-proposals/pull/46#).

[*]


name: datalayer/paper:features
version: latest
description: Datalayer Features
picto:

  • variable: picto
    server:
    image: datalayer/server:base:latest
    size: S
    prune: 1h
    kernel:
    image: datalayer/kernel:base:latest
    size: S
    prune: 1h
    datasets:
  • input:
    • variable : iris
      image: datalayer/dataset:iris:latest
  • output:
    • name: iris_predict
      variable: iris_predict
      type: pandas
      format: csv
      separator: ;
      init:
  • load.ipynb
    snippets:
  • |
    import …

I think we need to think both about local and github, mostly because in GH only text files will get first class treatment. If we solve for text files we get a lot of features besides display, like in-line comment reviews, change suggestions etc. Look at how markdown is handled today with GitHub in pull requests. There is button to show rendered markdown with diff, but you lose a lot of important review features in this view.

I think, from code reviewer perspective, ability to comment inline itself is reason enough for markdown notebooks. If I’d maintain open source project built mostly with notebooks, that feature would be a must have.

On top of that we also need to think about resolving git conflicts, which is next to impossible with current format.

My thinking is, if we can provide easy loseless conversion, we can allow myst as alternative. Since it’s just a config option people who need it may switch to myst backend if they choose to, and with set of tools to convert from one to another, it’s not going to be huge issue to just switch formats. I’ll work on prototype so we have some code to talk about.

Just a quick note here that I think

The format is difficult to evolve, not on a technical standpoint, but more on a community agreement aspect.

Is a feature, not a bug. Building community standards should take time, and creating a new standard, or evolving a pre-existing one, should be a long process especially if it is as major as using a new text format for jupyter notebooks.

As an example re: MyST - I obviously have an interest in growing the use and adoption of MyST notebooks, but I don’t think it’d be a good idea to “standardize” MyST until we’ve had many, many more months of usage, feedback, and ultimately an open community process to decide on whether it’s worth “officially” supporting.

The hard parts of open source are not technical, they’re social - but that’s also what gives projects like Jupyter their power. It is what makes these tools valuable to a broader audience instead of being designed for the particular needs of one company / online platform / individual.

3 Likes

Agree, but we should start somewhere, and MyST seems like a good place to start.
Just to be clear, I do not want to replace ipynb format entirely, but provide an option. We can slowly evolve this new format on the side and slowly grow adoption if idea is good. What I propose is to have alternative that you can configure. It’s going to be entirely pluggable and outside of jupyter codebase entirely initially.

I love that this topic is getting attention and momentum.

Here is an observation based on my experience using and developing for the SageMath notebook format. For context, the SageMath notebook format predated the ipynb format by a few years. The SageMath notebook project is now essentially abandoned and SageMath users have transitioned to Jupyter notebooks with a Sage kernel, or use CoCalc’s implementation of SageMath notebooks. The SageMath notebook format was a tarball containing what was basically an HTML file with code inputs and outputs embedded in it, encoded with a special triple-brace syntax. Also, any binary files like output images or other large cell outputs (IIRC) were stored in subdirectories in the tarball file, one subdirectory for each cell.

Since the internal document structure of a SageMath notebook was essentially an HTML file with code blocks, not unlike some of the proposals here having a notebook as a markdown file with special code blocks, it had decent diffs and was fairly readable. Though prose cells and code cells were presented as siblings that could be intermixed, in reality code cells were embedded in a single prose document, so you could, for example, have HTML in one prose cell affect the rendering of the code cells following it. For example (IIRC), you could arrange code cells in a table by sufficiently hacky uses of corresponding table tags in the surrounding prose cells.

I bring this up because the document structure and capabilities, and user expectations, are fundamentally different between the Jupyter notebook ipynb format (where cells are truly siblings in a JSON document, and can be considered independent of each other) and a notebook format where code cells are embedded in a document, such as code cells in a markdown document. Users might expect to be able to put code cells in a list, for example, because that would be valid in Markdown, even if it violates the traditional notebook structure of prose and code cells being siblings.

3 Likes

Agreed. For local workflows I think the technical side of the problem is solved (nbdime). What is missing is wide-spread awareness and adoption. So I think what is needed is a “marketing push”, as well as working on the resulting feedback from users. I think requiring an additional tool is fine. There is precedence for that: SVGs, images, movies, CAD drawings. Some of these are text based formats but most of the time you can’t really figure out if the change is “good” from looking at the textual diff only (except for the most trivial of edits the SVG renderer in my head is terrible :wink:).

GitHub already shows “rich” diffs for images and CAD drawings for example. Why not build on that idea and offer a “rich” diff for ipynb files as well? For a first version this could be the web view that nbdime offers.

For a first version of a text based diff I’d teach GH that when it is asked to diff two ipynb files:

  1. on the fly convert both to a text based format
  2. diff the text based format
  3. show that diff to users

This gives you inline comments and good diffs. Something to explore and work on in a v2 would be to look at taking edits applied to the converted document and figuring out what edits in the ipynb that corresponds to. the webview of nbdime has solved this already but I’d have to try how well/if it works for the text based one.

I like the idea of using different “renderings” (fancy HTML, text based, etc) of a notebook which is stored on disk as ipynb. Decoupling the on disk format and the rendered format is powerful. For example GitHub, Gitlab and Tim’s local diff viewer wouldn’t have to agree on how exactly we each render the “diffable view” as it only ever exists as a rendering on our websites. Or we could agree and quickly change it when we discover a new use-case because we don’t have millions of users who have old versions installed locally.

Main point: a notebook doesn’t need to be stored in a new file format to allow text based diffs. You can convert it on the fly to a suitable “rendering” and then diff those. Or use a more powerful diff viewer that can handle “not text” based diffs.

2 Likes

@betatim The “on the fly” approach would allow diff viewing, but I am not sure it will be suited for comments/reviews like we have today on GitHub pull requests.

@choldgraf Fully agree that standards take time and should be as generic as possible and not only serve specific cases.

1 Like

Can you expand on that a bit? It isn’t clear to me why that would be a problem/limitation.

I am thinking to cases where e.g. user foo comments on line 76 of cell 3, then user bar comments on foo comments and adds a new comment on line 87 of cell 6.

GitHub (or whoever is hosting the notebook) has to keep track of all those comments, across the different version of the notebook lifecycle. This track records will prolly be done in an different/external data structure/storage of the notebook. The challenge there is to keep accurate the links between the comments and the exact place where these comments apply, and this across the history.

If I had to implement such features, as a developer, I would feel in a better state with a text-based (markdown…) notebook instead with a json-based notebook.

But this is just a feeling I have, and the challenges to solve to have that are maybe of the same magnitude with text or json format.

Hello everyone, thanks for discussing this!

@inc0, with Jupytext 1.5 you can easily create text-only notebooks:

As you see there are many different formats. I think this is because people sometime need more than a Jupyter Notebook. For instance, Myst-MD notebooks work well with Sphinx and allow cross references and citations, a feature that does not exist in plain Jupyter notebooks. Notebooks represented as scripts work well in IDEs (for instance, the percent format is understood by Spyder, Hydrogen, VS Code and PyCharm). In your case, I think you are looking for a format that fits nicely on GitHub.

Currently Jupytext’s text-only notebooks don’t store outputs. We have an issue to discuss how we could store the outputs in Markdown notebooks. I think outputs should not be in the main text document but in separate files, in a directory named after the notebook (not unlike for the SageMath described by @jasongrout). For instance, if my output is a pandas data frame, I want it to be stored in unnamed_cell_55.html (or even better world_pop.html if the user gave a name to the code cell) and then include it in my main notebook. But not in the notebook itself, otherwise changing one line of code would create hundred lines of diff on the main file.

Concretely, one reason for which we have not finalized outputs in Markdown is that, according to my experiments, it is not possible to display an embedded HTML file in a Markdown file on GitHub. But maybe it is? If not, do you think this is something that could change?

PS: I like the hidden YAML div for the metadata in your Markdown sample notebook. The proposal also appeared at https://github.com/mwouts/jupytext/issues/527, and it’s scheduled for the next version of Jupytext.

3 Likes

I don’t think it solves issue I’m thinking about. Let me rephrase use case I’m trying to handle.

A maintainer of open source project that is heavily based on jupyter notebooks will need to make reviews routinely. We still want to retain outputs as project may be data science one, and some notebooks would display, say, dataset report. Currently, making PR into this project would be nearly impossible to handle - any PR would be effectively incomprehensible change in json, no way to comment on particular line change etc. If you use diffs locally, with say nbdime, you’d need to pull that change to local, diff it locally, see an issue in line X, go to json, find line X in diff and comment. If you use text notebooks, you lose outputs. If you store second representation of notebooks besides ipynb, every review would require you to manually find corresponding code in your ipynb notebook. Also have something like GH action that will semi-automatically render these second representations and commit them to branch, this is just hacky in my opinion.

It’s far more than just about diff rendering. We’ve been thinking about nbdime a lot and in my personal opinion it just doesn’t solve problems we have - collaboration on existing notebooks is fundamental problem here.

What I’m proposing is to have what would be MyST backend to notebooks that otherwise would behave exactly like typical ipynb. This handles easy review and everything else related to collaboration. There is a lot more tools for text file collaboration than ipynb format, and you’ll be able to use any of them.

That’s really cool:) If we use constructs like that we can create loseless conversion between ipynb. After that we tie this conversion at the point of saving a notebook file (save method in content manager) and loading it (get method), and we end up with fully transparent notebook experience that won’t differ in any way from how people work today, just change file level representation.

Author of this issue makes my point entirely:) Deprecating usage of notebook files because they can’t handle GitLab rendering.

I am :100: in support of sub-communities and projects exploring their own text-based notebook formats, utilizing tools like jupytext to do so. I think it’s good to explore that design space, and it will get us more information when it is time to have an “official” text-based format in Jupyter.

What I’m worried about is GitHub adopting a text-based format for notebooks before this process plays out. That’s because GitHub, because of its sheer size, could “create a standard” on its own by putting it in front of many, many eyeballs. I think GitHub should be in the business of leveraging and empowering pre-existing community standards, not creating new ones that suit GitHub’s engineering / product needs.

I’d recommend the following steps

tl;dr: any new standards should begin in the Jupyter community, follow community processes, and have a decent amount of adoption before they make it into any products in a vendor’s toolchain. If GitHub wants text-based notebooks, the best way to do so is to encourage and facilitate this process in the Jupyter community, help Jupyter adopt and implement this standard, and then build functionality around it in vendor UIs (such as GitHub).

  1. Improve GitHub support of .ipynb files.
    • Since there is no text-based jupyter notebook standard yet, begin by improving GitHub support of ipynb since it’s what we’ve got.
    • As @betatim notes, GitHub already does pre-processing and rendering of other file types (like SVG)
    • since nbdime exists there is already prior tooling that could be leveraged for diffing. We know this is possible because ReviewNB exists and does this nicely.
    • some nice UI could be built around commenting or suggesting edits on a notebook (e.g. render the notebook and attach “comment” buttons to each line of a code or markdown cell)
    • An opinionated take: GitHub should not implement any kind of core functionality around a notebook format that isn’t officially supported by the Jupyter community.
  2. Continue this conversation, and eventually open a JEP for a text-based notebook format.
    • The JEP process is the primary mechanism the jupyter community has for marking large, complex decisions that cut across many communities.
    • Continue this conversation around an “official” text-based version of a Jupyter Notebook, and eventually make it a JEP.
    • GitHub team members should be a part of this process to help understand the considerations needed for a text-based notebook. The ideas that @inc0 has provided here have been quite helpful!
  3. Implement that JEP and get feedback.
    • Add functionality for Jupyter UIs, as well as downstream tools like Jupyter Book or third-party tools like vscode plugins.
    • We will get a lot of user feedback about the standard, what people like, don’t like, etc.
    • This will give us a chance to iterate a bit on the format, and crystallize how it should look
    • It will also be an opportunity for GitHub to leverage this feedback in order to create a better end-product
  4. Implement in downstream products.
    • After we’ve gotten feedback from users and have an idea for how this text-based format is working for them, and are reasonably sure that it will not evolve much, then downstream vendors can feel comfortable knowing that they are using a community standard that has been battle-tested and worth adopting.
    • This is something I’m sure folks in the Jupyter community would be happy and excited to assist with.

extra note: As @jasongrout pointed out, it is trickier to do text-based notebooks than you’d think. With MyST-NB we’ve found people wanting to use the MyST notebook format in a way that actually breaks conventions from ipynb, so I think the design of that spec, and the documentation, validation, and tooling around it, will be important to think through carefully.

extra extra note: there is always going to be a trade-off between human readability and information that is useful for machines. I think outputs are the obvious place where this becomes clear. Even for the cleanest markdown notebook format, if you allow outputs to be embedded in the text file itself, it will become close to unreadable (e.g. if there are gigantic binary image blobs in there). That’s not a problem with JSON, it’s a problem of storing data as text.

6 Likes

I completely understand this worry and I want to emphasize, that when I’m here and having this discussion, I consider myself Jupyter community member first and GitHub employee second. I promise not to force an issue with corporate power, I’ve seen this happen too much in great open source projects and last thing I want is to be part of this problem. That’s also why I’m purposefully leaving out what GitHub can or can’t do internally to rendering. I’ll just say that we’ll not arrive into same level of support as text files have and always be left without tools that are otherwise available.

I think JEP is important, but we need to take few steps before we know enough to actually submit one.
Steps I’d like to take:

  1. Find group of people from diverse backgrounds that are interested in this effort, this thread is means to this end more than anything.
  2. Prototype new package that will implement some solution, see what we learn and how it works.
  3. Let people use this 3rd party driver (via pluggable content manager) for some time, iterate and improve behaviour, build robust community to support this.
  4. Once we feel good about standard, have some traction and healthy amount of testing and use cases, we can draft a JEP for it to be part in Jupyter itself

I think step #3 is by far hardest and critical for all of us to feel good about this standard, it’s important that we give it enough time and attention before attempting to change core Jupyter. Fortunately, we don’t need to do anything in Jupyter codebase to start working on it, we’ll write a plugin and go from there.

My main goal here is to start conversation, start some coding on the side and build community that will help us push this forward. We’ve gathered healthy amount of prior art here and it only shows that this is real issue that people struggle with.

I hope I alleviated your worries Chris? I really don’t want to do anything that would hurt Jupyter community or users.

3 Likes

Hey Chris–

Fellow Azure/GH person here. We DEEPLY respect the community.

In the spirit of having this be community led first, can you propose someone in the open source to lead this so that we are ABSOLUTELY not driving the boat? We’ll ask them to come up with a process here.

We’re all speaking as passionate users first, not employees - unfortunately, we can’t dictate what our corporate strategy will be :frowning: But we can work with folks who are already passionate according to their roadmap to help!

1 Like

Thanks both @inc0 and @aronchick for your responses. I’ll spend some more time thinking about this and would also love to hear from others in this thread think.

Really quickly to this point:

I consider myself Jupyter community member

I want to make it clear that you’re both members of the Jupyter community, and I deeply appreciate your engagement here on this thread. I hope that I’m not coming across as nay-saying here, I think it is awesome that you and your teams are supporting Jupyter, and I am deeply appreciative of your conversations here.

What I think the Jupyter community should do is be an ally to your organizations in order to ensure that there is a path forward to solving these challenges through community processes. I hope that you see enough of a path forward in working with the Jupyter community on this, so that it is easier for you to make a case internally that it’s “worth it” to do so. I know sometimes working with open source communities can seem chaotic and unpredictable, so hopefully we can improve some of these pieces.

As @inc0 notes - I don’t know that a JEP is needed right now, but maybe soon? At that point, I think it’d be good to find a shepherd that can facilitate the JEP conversation around a proposal. We’ll also need someone / a team to write up the proposal in the first place. I’m happy to be a part of that process when the time is right, as this is something that would definitely be relevant to the executablebooks project (which is the steward of MyST markdown)

Totally - this is the approach we’ve taken with MyST markdown (https://myst-parser.readthedocs.io/) and MyST notebooks (https://myst-nb.readthedocs.io/en/latest/use/markdown.html), and also one reason that I love jupytext, since it makes it possible to move back-and-forth between these different formats.

Do you think that with these pieces we could make some progress? Or do you see obvious holes that they are missing part of the solution? (e.g., maybe storing outputs is an example). Perhaps one place to start is to brainstorm where the current text-based solutions are lacking, and think of ways that we can improve upon that. Again, I am happy to think about this for MyST markdown, though I wanna admit my own bias in that direction as one of the folks working on it :slight_smile:

1 Like

That’s where my push for prototype comes from. From my super quick testing I think we are in good place. MyST seems to have all the tools we need (one potential gap would be rendering base64 encoded figures, but that’s just code to be written into MyST renderer). Biggest thing will be figuring out how to structure markdown to represent notebook, but we can just use what already exist since we have multiple robust solutions to render notebooks into MyST, what (I think?) we don’t have is ability to recover notebook from MyST file, this is where this issue becomes important. Once we make that work, it’s just matter of some glue code to make Jupyter itself save and load files in this format. After that we’ll know if this is feasible and useful solution. Way I see it - we are really close. I know I’m going to work on it over next week or two to see how far can I push it (and ofc everyone here is more than welcome to join!).

It’s just my opinion - and I have about 1000x less context than the rest of the folks, so please do correct me.

My GENERAL take is if the community decided that JSON formatted notebooks was the solution O(forever), then we’d try to build around that (again “we” is @inc0, @hamel, me and others - not GitHub the company). HOWEVER, if we felt like the community WAS going to go to a different format, we’d probably want to target that first.

This is in NO WAY saying the core format for Jupyter isn’t great for Jupyter! It is! It’s just such a bad mismatch for other things that people want to do (like complicated diffing, merging, collaboration, etc) - particularly in Git (not to say GitHub) workflows.

The amount of work necessary to figure out some of these things for JSON based notebooks is many many person years, and something that we’d only want to take on IFF the community decided there was no other way. As has been mentioned many times in this thread, this is a hard problem and there are already solutions here, and an alternative would be adopting some of those solutions behind the scenes to do work.

Sadly, that still wouldn’t offer a solution for some of what @inc0 talked about (specifically, being able to collab on a notebook - JSON’s flow is just too hard for that), but we’ll jump off that bridge when we come to it.

This is where it does parallel SVG/etc - those are view only formats, which are not terribly interesting. GitHub already offers views of Jupyter notebooks, we’re looking for better other experiences.

Like I said, I think step one is having a non-GitHub person lead. We don’t in ANY way want to stomp around where we have limited context. Just point us at some discussions and/or the right people and we’ll be an extra set of arms and legs!

2 Likes

A few quick thoughts

we don’t have is ability to recover notebook from MyST file

Could you explain a bit more what you mean by this? MyST markdown notebooks work with Jupytext, so you can two-way the content and metadata back and forth. Do you mean the outputs specifically?

decided that JSON formatted notebooks was the solution O(forever)

I don’t know that we’re talking about replacing the ipynb format with something text-based, more like “also accepting a particular pattern for text-based notebooks”. I think it’s an important distinction because it may mean that the two formats don’t need complete parity. E.g., sometimes it’s useful to have a big machine-readable blob of data like JSON, other times it’s useful to have something human-readable, but there are tradeoffs between both.

As others have mentioned - this quickly becomes a hairy and opinionated topic. E.g., I just checked and apparently the very first JEP was somebody proposing something like this as well. The nbformat repository may also have useful discussions from the past. Rest-assured the issue of non-human-readableness has been a part of the conversation from day 1, so others may have ideas for why the JSON-structured format is used (and maybe should continue to be used).

specifically, being able to collab on a notebook - JSON’s flow is just too hard for that

@saulshanabrook has been recently spearheading an effort to figure out a data model etc for real-time collaboration in Jupyter environments. I wonder if some of that would be useful to github to think through collaboration etc (I’m guessing many of that conversation will be more low-level but it might intersect with notebook format stuff).

I think step one is having a non-GitHub person lead

I’m less-concerned that the lead is or isn’t from any one particular company. I think what’s more important is that the process of discussion, deciding, etc is one that is open and inclusive, and that many stakeholders in the jupyter community (other companies, open source contributors, researchers and educators, etc) have a chance to weigh in on. As I mention above - you’re both part of the Jupyter community too, and as such the goal when making decisions should be to bring as many people of diverse backgrounds along as we can.

Once we make that work, it’s just matter of some glue code to make Jupyter itself save and load files in this format. After that we’ll know if this is feasible and useful solution. Way I see it - we are really close.

I think that we have certainly made progress. I’m not sure yet that we are close :slight_smile: I think we’re close to having a solution possible, but this is very different from the solution that is used for community adoption. Or put another way, in my 4-point list above you suggested that 3 (implementation) would be the most work and time. I disagree - I think that 2 will be significantly more time and work. Just look at the many, many, many other times that folks in the Jupyter community have proposed a text-based ipynb (the JEP I linked above is, I think, just the tip of the iceberg). IME, getting a diverse group of people to agree on something is always way harder than building technology.

That said, much has changed over the years and perhaps we are closer to a pattern that is worth adopting, it’s encouraging to see so much interest and enthusiasm around this. Maybe an interesting place to start is to figure out if we can get lossless two-way conversion between ipynb and MyST notebooks via jupytext. It’s something I’d be interested in iterating on (and perhaps it’d be worth making a myst-notebok specification repository instead of having it as a reference implementation inside myst-nb).

I think that a good guide for when it is time for a JEP is if we have a good answer to all of the questions that are in the JEP PR template: Summary — Jupyter Enhancement Proposals

Yes, also kernel information etc. Basically all the info from ipynb to be stored in MyST in some form, probably as yaml blocks.

I agree, but I also think that a solution is a stepping stone. I don’t think anybody here reasonably think about fully replacing ipynb overnight. That’d be crazy:) I’m not sure if we ever would want to replace ipynb, there is wealth of tools build on top of this format and we shouldn’t throw that away. Also, this discussion is waaay in the future as I see it. What I personally want is to start working on first iteration and see where that gets us. Have a pluggable content manager and start using it. That will give us more information than any amount of discussion.

Not sure if jupytext is correct place for it, it may be, I’m personally not a huge fan of having 2 representations at the same time, but that’s just my opinion right now. I’ll dig more into it. I’d start by standalone solution and see if it can be reused in jupytext. Once we have lossless conversion somewhere, it should be trivial to reuse it wherever.

I’m reluctant to start off from specification doc is, from my experience, it takes forever to agree on it and then often blows up when actually try to implement it. I’m more in favor of prototype-driven work, when we’ll iterate on code and format will grow naturally out of it. Only then I’d be comfortable with writing JEP, because we’ll know exactly what we want to achieve and have example of it actually working.

A lot of text has been posted (on a Sunday no less), so I’ll try to summarise to see where we are:

  1. there are a lot of existing text based notebook formats out there already. There isn’t one (or more) that does everything that would be needed.
  2. a lot of knowledge and trade-offs discussions have already happened in issues on the JEP repo as well as jupytext.
  3. one of the next steps is to create a contents manager that uses a (new) text only format that allows people to use and explore it (code first, not spec first)

One question I am left with and realised I don’t know the answer to: is there a (prototype) single file, round tripable, text based format somewhere? I think all the jupytext formats either use a second file or are not round-tripable. Myst is also missing something. Is this correct? Should we find a format that is close to checking these boxes and then work a bit on making it tick those boxes?

Then once we have that create a small prototype contents manager that uses this format?

These steps seem like a good way to get to a working bit of code that people can test drive. Which then makes it easier to submit concrete notebooks for which this “doesn’t quite work for this notebook” as issues. And then we iterate.

What do you think @inc0?

1 Like