Microsoft Word Integration (Intern Project)

dereklam · July 3, 2019, 12:34am

Hello everyone! We are Derek and Isabela, one of the intern teams at Jupyter Cal Poly and we will be working on Microsoft Word integration this summer. Right now we have outlined two key goals that our project will encompass:

Convert between .ipynb and .docx formats, preserving the interactivity of code cells.
Incorporate familiar elements of word processors, including WYSIWYG menus in markdown cells and previewing .docx files in JupyterLab.

We are still in the research phase and would love to hear your thoughts on the project.

isabela-pf · July 3, 2019, 12:43am

Hi, I’m the other intern on this project. I look forward to working with the community and solving all our Word Integration problems!

psychemedia · July 3, 2019, 2:26pm

Very loosely related to this, the organisation where I work uses an XML document format for mastering educational content. (The XML was originally created using a modified version of word, now authored using an XML editor, though perhaps with conversions from docx. docs?).

I’ve just started exploring various ways of converting the XML to a notebook format; at the moment, I’m taking a simplistic route of converting the XML doc to a markdown format that Jupytext can convert to ipynb, but I should really look at parsing it into pandoc somehow.

One of the issues I’ve found is that the schema for our XML docs supports various flavours of code style, and they aren’t used consistently. There is also an issue that things are represented as code elements in Word docs that don’t necessarily work as code when rendered in to a notebook code cell.

I need half a day to get my code decoupled from some internal scrapers but hope to get round to that in a week or too.

Also probably tangential, here’s another notebook - MS Office interop demo I spotted via the Twitterz: Convert Jupyter notebook to Excel spreadsheet https://github.com/ideonate/nb2xls

betatim · July 4, 2019, 8:45am

This is super exciting!

I keep telling people that we need a way so that people can use Jupyter from their Word document. A document with words for humans first that just happens to have a few executable cells in it vs the current model a bunch of code with maybe some markdown code in it

Many moons ago I made an attempt to add a WYSIWYG markdown cell to nteract https://github.com/nteract/nteract/pull/3699 if you want to copy any (or all) ideas.

I’m super excited seeing what comes out of this. Having executable cells in a Word document would be huge (like huge huge HUGE) for Jupyter.

isabela-pf · July 8, 2019, 4:47pm

Thanks for weighing in with support and resources, especially since draft.js is one of the many ways we are exploring the WYSIWYG editor part of this project. It’s helpful to see what you’ve already explored and where some of the obstacles were.

Right now, having executable cells in a Word document is looking like a reach goal, but it is definitely something we want to keep in mind as we are moving forward. Thanks for your enthusiasm!

lrasmus · July 8, 2019, 5:05pm

Agree, very excited to hear about this! I am involved with another project (StatTag - http://stattag.org | https://github.com/stattag) where we’ve been working on integration with Word for scientific manuscripts. Recently we started looking at how to integrate MATLAB via Jupyter kernels, not to replicate the notebook/cell effect, but to link in results (values, tables, figures).

We’ve got a start on this using C# as the development language for Windows, will be porting Obj-C or Swift on macOS. I don’t want to take away from your internship experience, but would love to understand if there’s a chance to collaborate on this effort.

dereklam · July 8, 2019, 5:18pm

Thank you for the feedback, the several resources listed, and possible issues in conversion. We were also curious to see if parsing the docx file via Pandoc is a viable option. It seems like Pandoc converts .docx to .ipynb into a single markdown cell, which may need to be expanded further. We’ll definitely check out the conversion resources and apply what we can to our project.

Thank you again for your help.

bollwyvl · July 9, 2019, 3:07pm

/is too new to post links, all @ links are github projects

Good stuff!

Despite the ubiquity of the MS format, seeing good jupyter developer effort being thrown at proprietary APIs which can shift out from under our feet (see Google realtime, etc) make me . See recent “the books will stop working” microsoft DRM thing. Therefore, I’ll dutifully lodge my argument for supporting .odt (whether .docx is supported or not). .odt readers can run on extremely low-power/cost devices ($5 Raspberry PI zero), the same cannot be said for MSO. Having an open source backbone, such as LibreOffice, would allow the feature to actually be tested in the system-of-interest, as I don’t think there’s any free CI that comes with Word.

On that note, @rossant/ipymd already did a good-enough-to-publish-a-book roundtripping from odt, but the project has languished some, and @podoc/podoc (by the same author) never picked up much steam. Certainly worth a look!

As to the frontend implementation: i also did a rough proof-of-concept on @deathbeds/jupyterlab-outsource.

It needs updating to lab 1.0 (and the binder may be broken as a result).

This came from a @dsblank comment at last JupyterCon that the biggest impediments to a starting-from-scratch student using Jupyter are:

WYSIWYG text editing (so that can write homework)
visual discovery of available programming constructs

Ignoring the latter (addressed in outsource by @google/blockly, and a whole other story), the first-pass WYSIWYG approach used @ProseMirror/prosemirror to replace link with the model of the current ~~markdown~~ (actually any) cell. While the bar for working with prosemirror is a little higher than other WYSIWYGs, it’s an extremely rich API.

I got gummed up on trying to embed a CodeMirror inside a Prosemirror, but many revs have passed since then, and it’s definitely worth another look. Initially, that would just be a prose-about-code block, e.g. ``` i think it’s reasonable to assume one could have a markdown-forward UX that allowed you to just start writing text, and choose between literal code and to-be-interpreted code, and have the outputs appear directly in the text, without even worrying about “cells” is just a data model away.

The other Big Deal is $M_{ath}. If/after I got code working, I’d probably take a serious look at @mathquill/mathquill.

Finally, an archival format like PDF/A-2 would be a highly desirable output of spending the effort to make your italics and tables just right. Among other things, PDF/A-2 allows one to store a whole file tree inside the artifact, which means you could stuff your source notebook (and supporting files like sample data) inside the same artifact, sign it, and every PDF viewer will be able to read it.

PDF/A-2 appears to be landing soon in libreoffice:

https://bugs.documentfoundation.org/show_bug.cgi?id=62728.

No doubt one could do this from .docx, but it’s very unlikely most (Linux) servers are going to have a WINE office around. For truly rich things, it may still be necessary to have a (headless) browser in the loop to generate fully-rendered outputs, but QTWebEngine is making this increasingly plausible (see @deathbeds/nbconvert-pdfqt).

Good luck! Happy to discuss further if there’s going to be any public process around this!

betatim · July 9, 2019, 4:24pm

That is a pretty exciting idea and I didn’t know you can store arbitrary files in your PDF/A-2!! Once you sign the whole thing that is like a perfect (close as) little bundle to store for your compliance/reporting/auditing needs!

bollwyvl · July 9, 2019, 5:41pm

Another use case I’d like to point out: the folks behind NASA’s @Open-MBEE/ve expressed some interest (including prototypes) of using Jupyter-kernel-created plots inside their View Editor (not so much into branding), which is basically like multi-user Word powered by a (ridiculously) deep underlying model: e.g. The ancipated transit time from [orbit x] to [orbit y] is [z hours]. These numbers would continuously update as a space mission is planned by expert users of engineering-grade tools.

One of their desires is if you want to play a what-if, you could actually click on a plot and get dropped into an interactive environment, and be able to comment on stuff.

The current editor is built on angular, but telling a compelling story on top of Lab seems like an important play to be able to make.

It’s certainly worth dropping them a line: they’re just down the road in Pasadena

isabela-pf · July 9, 2019, 6:09pm

Thanks for your thoughtful reply! We really appreciate you compiling a strong list of resources for us to work with.

In response to your concerns about the proprietary nature of .docx, I personally agree with you. Our team has been asked specifically to support .docx conversion so we will still be putting time there, but we will add .odt support to our proposal since there is a good argument for it. We’ve already been looking at @rossant/ipymd, so we’ll take a look their .odt roundtripping too. PDF/A-2 looks like it might be more of a reach with our current plans but we will keep it in mind.

We also have just started experimenting with ProseMirror and really appreciate you sharing your past work with it in notebooks. It looks like it will really help us jumpstart this part of our project. Do you know if your extension only runs with certain versions of JupyterLab or if there are extra steps needed to run it?

We do plan on updating and keeping our progress public, so we’ll be sure to add more to this thread and let you know if we have any more discussion about the points you’ve brought up.

On a related note, would you be interested in participating in usability tests later on in this project since you seem to have thought a lot about this?

Thanks again!

bollwyvl · July 9, 2019, 7:44pm

Pragmatism is indeed the order of the day!

Just suggesting as a thing to not engineer away from. Having spent some of my life that i can never get back from even worse formats (e.g. rtf) to DOM, if ipynb → docx/odf → soffice --convert-to pdf → PDF.js to preview can be stomached, it’s going to Just Work, and look better than anything that can be built in a summer. The file embedding and tagging provided by PDF/A-2 ISO standard are just gravy, but would be increasingly-necessary boxes to tick for getting notebooks into highly-regulated environments.

Yeah, it was written for 0.35, along with the rest of the 20+ @deathbeds extensions, all of which need to be revisited for 1.0. The binder fell over due to a malicious node package (surprise), so even master won’t even work. But I can prioritize getting outsource back up first!

For sure: keep it working on binder, and I will drive people to your door!

bollwyvl · July 12, 2019, 12:01am

We’re back in the demo business on jupyterlab-outsource

Despite “working”, this one hasn’t really cooked enough to be released, was mostly written in the hallway at JupyterCon, has no tests, and to my knowledge, has not been used by anyone for serious work. Still, I hope it’s helpful!

If there’s anything else I can help with, let me know!

lrasmus · July 16, 2019, 1:07pm

@isabela-pf and @dereklam
In re-reading the goals of your project, I’m curious what are your use cases/motivations for supporting Word? In our project we work with researchers who are comfortable with Word for generating manuscripts, like to use track changes when working with biostatisticians, and publish in journals that prefer or require submissions in Word. To this end, we need to support a round-trip editing workflow in Word for a manuscript, which is why we went the route we did. I’m very interested to learn more about your intended uses.

dereklam · July 17, 2019, 8:16pm

Thank you for getting jupyterlab-outsource up so quickly! This helped greatly in setting up ProseMirror inside of the JupyterLab environment and allowed us to move forward in the project in regards to the rich text editor. We appreciate your help and will let you know if anything else comes up.

isabela-pf · July 19, 2019, 12:11am

To be honest, many of our motivations for prioritizing .docx conversion align with the use cases you outlined. There seems to be a good number of users who have workflows where they find benefits working in Word because of personal preference, because they collaborate with people not comfortable working in notebooks, or because of a need to take advantage of features in Word that notebooks lack (like Track Changes, as you mentioned). Converting files to meet submission requirements is also a key use case.

Looking at StatTag also made me want to clarify our own project. Word integration probably isn’t the best description in that we are really more trying to support workflows that interact with both word processors (especially Microsoft Word) and JupyterLab than actually integrate one into the other. It’s an important distinction, and I apologize if it was unclear. That being said, I think that there is still a large amount of overlap in the workflows we are addressing and there may be a degree of overlap in our technical approach as well. Please let me know if you have any questions; I’d love to keep this discussion open.

If you are willing to share, I’d love to hear more about your project’s current workflow. Hearing about how you are currently managing conversion and if there are any other features that you go to Word to specifically use would be helpful info. I saw your link to StatTag on your previous comment, so is that how you are currently working in Word or is it a combination of processes?

lrasmus · July 22, 2019, 1:44pm

Thanks so much for sharing more about your project! Sorry if I jumped to some conclusions - I admittedly have a mental bias when I see “Word integration”.

For StatTag, the workflow we want to support is primarily around statistical analysts collaborating with other researchers. The analyst typically does their work in one or more statistical programs of choice (R, SAS, Stata), and then is ready to collaborate with a researcher on the manuscript. The analyst uses StatTag to insert values, tables, and figures into the manuscript draft (we use Word fields as our special placeholders). The nice thing about using fields is that any one else - even someone without StatTag - can edit the document, track changes, etc. Anyone with StatTag + the data + the code can then update the results with a few clicks.

We’re now looking at using Jupyter kernels as a much smarter and effective way to connect to other programs and languages that have kernels built (Python, MATLAB, etc.). We can invoke a kernel, now we’re working on the messaging piece to send code and get results.

Apologies for getting a little long-winded on our project - don’t want to detract from the awesome stuff you’re doing! But yes, would love to see what pieces of this may align with yours. Likewise, happy to set up a conference call if you would like to talk more about this.

isabela-pf · August 13, 2019, 8:27pm

Thanks to everyone for your support and interest. @dereklam and I want to provide some updates on this project.

We’ve been focusing on the WYSIWYG rich text editor portion of our initial proposal by making an extension that transforms Markdown cells into rich text-editing cells powered by ProseMirror. This means that text is shown in its rendered form without running the cell, text formatting can be done via buttons, and knowing Markdown syntax is not required for text formatting. Ideally, this results in an improved writing experience in JupyterLab that makes it more pleasant for current users and more accessible for potential users who are familiar with word processors.

example

Our snazzy new name for this project is Jupyter Scribe. It describes how the extension focuses on the experience of writing text in JupyterLab. Scribe can be a noun or a verb, allowing it to represent both the activity of writing and the role of the user as author. It also (perhaps most importantly) moves us away from describing the project as Word integration, as that was a misnomer that caused confusion from the start.

Even though the writing experience Jupyter Scribe provides for JupyterLab is new, the UI is meant to blend seamlessly with existing JupyterLab components. Here are some ideas on what the design should look like in its full implementation.

Please visit our repo and check out the binder to see what progress we’ve made. We’d love to hear your feedback!

choldgraf · August 14, 2019, 12:06am

Really cool! Thanks for the demo, I think it’s neat!

A few questions:

Is it possible to expose other UI elements for common markdown things, such as links?
Is it possible to see the underlying markdown that is being generated? Or does it only show the rendered markdown?
In the future, we might want to make more arbitrary “button -> markdown syntax” mappings. E.g. if somebody implemented citation functionality. Is it possible to define these mappings within ProseMirror?

betatim · August 14, 2019, 5:17am

I love it ! Sometimes the space character gets lost You can’t tell from the recorded GIF but I did press the spacebar at the right places between words. It briefly appears but then gets “eaten” somehow. If you keep typing then sometimes it starts working flawlessly.

Topic		Replies	Views
Focus Mode (Intern Project) JupyterLab	22	4139	June 10, 2021
WYSIWYG markdown editor in JupyterLab? JupyterLab	8	6473	May 22, 2019
Jupyter and GitHub - alternative file format Notebook community , idea	101	9814	May 31, 2021
Jupyter Community Workshop - Notebook file format Events announcement , community , notebook	3	660	January 16, 2023
Official release: JupyterLab Tabular Data Editor extension JupyterLab release	1	6118	September 2, 2020

Microsoft Word Integration (Intern Project)

Related topics