Microsoft Word Integration (Intern Project)

Hello everyone! We are Derek and Isabela, one of the intern teams at Jupyter Cal Poly and we will be working on Microsoft Word integration this summer. Right now we have outlined two key goals that our project will encompass:

  1. Convert between .ipynb and .docx formats, preserving the interactivity of code cells.

  2. Incorporate familiar elements of word processors, including WYSIWYG menus in markdown cells and previewing .docx files in JupyterLab.

We are still in the research phase and would love to hear your thoughts on the project.

4 Likes

Hi, I’m the other intern on this project. I look forward to working with the community and solving all our Word Integration problems!

2 Likes

Very loosely related to this, the organisation where I work uses an XML document format for mastering educational content. (The XML was originally created using a modified version of word, now authored using an XML editor, though perhaps with conversions from docx. docs?).

I’ve just started exploring various ways of converting the XML to a notebook format; at the moment, I’m taking a simplistic route of converting the XML doc to a markdown format that Jupytext can convert to ipynb, but I should really look at parsing it into pandoc somehow.

One of the issues I’ve found is that the schema for our XML docs supports various flavours of code style, and they aren’t used consistently. There is also an issue that things are represented as code elements in Word docs that don’t necessarily work as code when rendered in to a notebook code cell.

I need half a day to get my code decoupled from some internal scrapers but hope to get round to that in a week or too.

Also probably tangential, here’s another notebook - MS Office interop demo I spotted via the Twitterz: Convert Jupyter notebook to Excel spreadsheet https://github.com/ideonate/nb2xls

2 Likes

This is super exciting!

I keep telling people that we need a way so that people can use Jupyter from their Word document. A document with words for humans first that just happens to have a few executable cells in it vs the current model a bunch of code with maybe some markdown code in it :slight_smile:

Many moons ago I made an attempt to add a WYSIWYG markdown cell to nteract https://github.com/nteract/nteract/pull/3699 if you want to copy any (or all) ideas.

I’m super excited seeing what comes out of this. Having executable cells in a Word document would be huge (like huge huge HUGE) for Jupyter.

3 Likes

Thanks for weighing in with support and resources, especially since draft.js is one of the many ways we are exploring the WYSIWYG editor part of this project. It’s helpful to see what you’ve already explored and where some of the obstacles were.

Right now, having executable cells in a Word document is looking like a reach goal, but it is definitely something we want to keep in mind as we are moving forward. Thanks for your enthusiasm!

1 Like

Agree, very excited to hear about this! I am involved with another project (StatTag - http://stattag.org | https://github.com/stattag) where we’ve been working on integration with Word for scientific manuscripts. Recently we started looking at how to integrate MATLAB via Jupyter kernels, not to replicate the notebook/cell effect, but to link in results (values, tables, figures).

We’ve got a start on this using C# as the development language for Windows, will be porting Obj-C or Swift on macOS. I don’t want to take away from your internship experience, but would love to understand if there’s a chance to collaborate on this effort.

Thank you for the feedback, the several resources listed, and possible issues in conversion. We were also curious to see if parsing the docx file via Pandoc is a viable option. It seems like Pandoc converts .docx to .ipynb into a single markdown cell, which may need to be expanded further. We’ll definitely check out the conversion resources and apply what we can to our project.

Thank you again for your help.

/is too new to post links, all @ links are github projects

Good stuff!

Despite the ubiquity of the MS format, seeing good jupyter developer effort being thrown at proprietary APIs which can shift out from under our feet (see Google realtime, etc) make me :crying_cat_face:. See recent “the books will stop working” microsoft DRM thing. Therefore, I’ll dutifully lodge my argument for supporting .odt (whether .docx is supported or not). .odt readers can run on extremely low-power/cost devices ($5 Raspberry PI zero), the same cannot be said for MSO. Having an open source backbone, such as LibreOffice, would allow the feature to actually be tested in the system-of-interest, as I don’t think there’s any free CI that comes with Word.

On that note, @rossant/ipymd already did a good-enough-to-publish-a-book roundtripping from odt, but the project has languished some, and @podoc/podoc (by the same author) never picked up much steam. Certainly worth a look!

As to the frontend implementation: i also did a rough proof-of-concept on @deathbeds/jupyterlab-outsource.

It needs updating to lab 1.0 (and the binder may be broken as a result).

This came from a @dsblank comment at last JupyterCon that the biggest impediments to a starting-from-scratch student using Jupyter are:

  • WYSIWYG text editing (so that can write homework)
  • visual discovery of available programming constructs

Ignoring the latter (addressed in outsource by @google/blockly, and a whole other story), the first-pass WYSIWYG approach used @ProseMirror/prosemirror to replace link with the model of the current markdown (actually any) cell. While the bar for working with prosemirror is a little higher than other WYSIWYGs, it’s an extremely rich API.

I got gummed up on trying to embed a CodeMirror inside a Prosemirror, but many revs have passed since then, and it’s definitely worth another look. Initially, that would just be a prose-about-code block, e.g. ``` i think it’s reasonable to assume one could have a markdown-forward UX that allowed you to just start writing text, and choose between literal code and to-be-interpreted code, and have the outputs appear directly in the text, without even worrying about “cells” is just a data model away.

The other Big Deal is $M_{ath}. If/after I got code working, I’d probably take a serious look at @mathquill/mathquill.

Finally, an archival format like PDF/A-2 would be a highly desirable output of spending the effort to make your italics and tables just right. Among other things, PDF/A-2 allows one to store a whole file tree inside the artifact, which means you could stuff your source notebook (and supporting files like sample data) inside the same artifact, sign it, and every PDF viewer will be able to read it.

PDF/A-2 appears to be landing soon in libreoffice:

https://bugs.documentfoundation.org/show_bug.cgi?id=62728.

No doubt one could do this from .docx, but it’s very unlikely most (Linux) servers are going to have a WINE office around. For truly rich things, it may still be necessary to have a (headless) browser in the loop to generate fully-rendered outputs, but QTWebEngine is making this increasingly plausible (see @deathbeds/nbconvert-pdfqt).

Good luck! Happy to discuss further if there’s going to be any public process around this!

4 Likes

That is a pretty exciting idea and I didn’t know you can store arbitrary files in your PDF/A-2!! Once you sign the whole thing that is like a perfect (close as) little bundle to store for your compliance/reporting/auditing needs!

Another use case I’d like to point out: the folks behind NASA’s @Open-MBEE/ve expressed some interest (including prototypes) of using Jupyter-kernel-created plots inside their View Editor (not so much into branding), which is basically like multi-user Word powered by a (ridiculously) deep underlying model: e.g. The ancipated transit time from [orbit x] to [orbit y] is [z hours]. These numbers would continuously update as a space mission is planned by expert users of engineering-grade tools.

One of their desires is if you want to play a what-if, you could actually click on a plot and get dropped into an interactive environment, and be able to comment on stuff.

The current editor is built on angular, but telling a compelling story on top of Lab seems like an important play to be able to make.

It’s certainly worth dropping them a line: they’re just down the road in Pasadena :wink:

2 Likes

Thanks for your thoughtful reply! We really appreciate you compiling a strong list of resources for us to work with.

In response to your concerns about the proprietary nature of .docx, I personally agree with you. Our team has been asked specifically to support .docx conversion so we will still be putting time there, but we will add .odt support to our proposal since there is a good argument for it. We’ve already been looking at @rossant/ipymd, so we’ll take a look their .odt roundtripping too. PDF/A-2 looks like it might be more of a reach with our current plans but we will keep it in mind.

We also have just started experimenting with ProseMirror and really appreciate you sharing your past work with it in notebooks. It looks like it will really help us jumpstart this part of our project. Do you know if your extension only runs with certain versions of JupyterLab or if there are extra steps needed to run it?

We do plan on updating and keeping our progress public, so we’ll be sure to add more to this thread and let you know if we have any more discussion about the points you’ve brought up.

On a related note, would you be interested in participating in usability tests later on in this project since you seem to have thought a lot about this?

Thanks again!

1 Like

Pragmatism is indeed the order of the day!

Just suggesting as a thing to not engineer away from. Having spent some of my life that i can never get back from even worse formats (e.g. rtf) to DOM, if ipynb -> docx/odf -> soffice --convert-to pdf -> PDF.js to preview can be stomached, it’s going to Just Work, and look better than anything that can be built in a summer. The file embedding and tagging provided by PDF/A-2 ISO standard are just gravy, but would be increasingly-necessary boxes to tick for getting notebooks into highly-regulated environments.

Yeah, it was written for 0.35, along with the rest of the 20+ @deathbeds extensions, all of which need to be revisited for 1.0. The binder fell over due to a malicious node package (surprise), so even master won’t even work. But I can prioritize getting outsource back up first!

For sure: keep it working on binder, and I will drive people to your door!

1 Like

We’re back in the demo business on jupyterlab-outsource binder

Despite “working”, this one hasn’t really cooked enough to be released, was mostly written in the hallway at JupyterCon, has no tests, and to my knowledge, has not been used by anyone for serious work. Still, I hope it’s helpful!

If there’s anything else I can help with, let me know!

1 Like

@isabela-pf and @dereklam
In re-reading the goals of your project, I’m curious what are your use cases/motivations for supporting Word? In our project we work with researchers who are comfortable with Word for generating manuscripts, like to use track changes when working with biostatisticians, and publish in journals that prefer or require submissions in Word. To this end, we need to support a round-trip editing workflow in Word for a manuscript, which is why we went the route we did. I’m very interested to learn more about your intended uses.

Thank you for getting jupyterlab-outsource up so quickly! This helped greatly in setting up ProseMirror inside of the JupyterLab environment and allowed us to move forward in the project in regards to the rich text editor. We appreciate your help and will let you know if anything else comes up.

1 Like