Gathering feedback for Wikifunctions proposal based on Jupyter kernels

Hi all!

Over at Wikimedia Foundation, we’ve got a project we’re working on called Wikifunctions as part of Abstract Wikipedia, aiming to create a library of user-written functions which will be callable from Wikipedia page markup. This is intended to pull data from our Wikidata RDF store or user-contributed data files, process it, and format to suitable table layouts or natural-language text fragments.

I’ve written up an early proposal on building this around Jupyter kernels, which I think would be a good way to balance our requirements for isolation between separate functions, multiple language implementations, and scaling & security without reinventing the wheel on scripting language execution services.

There are a number of performance trade-offs where we get increased speed by making it explicit that functions can be non-deterministic in some ways (such as a function implemented in a language with mutable state returning different results on subsequent calls), but we can keep separate functions from polluting each other’s environments by spinning each function in the call graph up on a separate kernel.

I’d love to hear some feedback from folks more familiar with the low-level kernel interfaces and implementations.

This would involve spinning up a lot of running kernels for a short amount of time: each function used in each render session will spin up its own interpreter context for the duration of the session. A typical article render should last a fraction of a second, might call a few dozen separate functions (potentially this could grow) and invoke calls a few hundred or few thousand times (potentially this could grow as heavier use of the function system is made).

In particular we could potentially get a big benefit from running multiple kernels in one process & thread, so they can make synchronous calls to each other without a context switch. But my impression is that the current kernels are designed to do async communication over a socket, and don’t implement a direct-call API, so this would be “hard”. It also would reduce safety isolation between kernels, and make it harder to spin up standard Docker images as the kernel runtimes.

I think I can reduce the overhead of socket communications by co-locating all kernels in a call session on the same server, which involves a context switch but no network latency. Serialized arguments could be transferred in a shared memory segment to reduce copy overhead.

For debugging sessions and generating stack traces we’ll need to have a “meta-kernel” that can be an intermediary for the Jupyter debugging protocol, sending data requests and commands on to the appropriate kernel implementing a given level of the call stack.

We’re still not decided on this, but it looks very promising and I’m hoping to research it further.

Thanks for any feedback you may have!

3 Likes

Fascinating stuff, thank you for sharing!

I don’t think there are a lot of examples of running multiple kernels inside a single process, but don’t see why it wouldn’t work. Of the one-kernel-running-other kernels, the most well-known are probably beakerx, SoS and allthekernels. Additionally, here’s a crusty old experiment treating kernels as widgets. I don’t see why a KernelManager couldn’t be implemented that quacked the same way as a ZMQ one, but was just over some other IPC mechanism.

There’s also the on-going work related to collaborative editing in JupyterLab. It could well be that a shared object model could be effective for this purpose: I’m imagining a lowest-common-denominator JSON(-LD)/HTML/RDFa document which might be touched by many kernels, each enriching different parts of it as they know how, from “dumb” XML/JSON parsing, to XML/JSON (Table) Schema all the way up to SHACL. With the page reader as a participant, this could even be a guided process… folk may well be wiling to tolerate somewhat less performance “answers” if they feel like they are guiding the process. The LSIF format may be interesting as well, if thinking about wiki markup as “code” wouldn’t be too alien to the kinds of audiences you’re trying to serve.

Semi-related: one thing that’s been kicking around in my mind is the potential for WASM-based implementations of a number of Jupyter components, including kernels, a la pyodide, that could run on the server or in the browser. On the server side, these have nice encapsulation profiles beyond what one could get from containers alone, and some projects like enarx are going even further into the “trusted” computing space. But if executable environments could be shipped to the browsers, suddenly there’s much less need to do server gymnastics, just be able to serve up (and effectively cache) Big Old WASM, which would actually work for an organization of your scale. As to interacting with those shared objects: I don’t know if they took off much, but WAIF seemed like a path to more shared types.

We’ve been trying to get more WASM stuff delivered to the Jupyter-adjacent conda-forge space, and presently have emscripten and wasmer up and running, but there’s a lot more work to do… for example, getting the wasmer python bindings working will open up a lot of interesting avenues.

Excited to see what you turn up!