Tool for Notebook Workflows?

Hey there. I am looking for a tool that let’s me string together the outputs of a series of notebooks in a UI. Does this exist?

E.g. notebook for importing and splitting up dataset. Then notebook for training model. Then notebook for working with results. Then notebook for visualization.

Problem to Solve: my notebook is now… thousands of lines of code. I want isolation/encapsulation from a best practice coding perspective. Maybe I could use different Python versions or different conda envs or even switch over to either R or Scala in each notebook. If you were going all out - then you would be able to specify deeper things like Java version being used.

Something like Rabix, but for notebooks instead of containers.

2 Likes

Observable posted this: https://observablehq.com/@observablehq/introducing-visual-dataflow

Netflix has done a huge amount of work in pipelines of notebooks: https://netflixtechblog.com/notebook-innovation-591ee3221233 - also check out Papermill and Paperboy

IBM just released some open source extensions dealing with pipelines of notebooks as well: https://github.com/elyra-ai/elyra

2 Likes

Netflix’s Scrapbook now has the related functionality for handling the outputs, see here for some example use/discussion.

There is also https://github.com/krassowski/nbpipeline.

1 Like

I’ve not managed to play with this yet, but as @jasongrout mentioned, you might be able to co-opt https://github.com/elyra-ai/elyra to do some of what you want? It allows you to use a GUI to string notebooks together in a pipelline and then execute them using Kubeflow Pipeline (tho’ the docs say “Currently, the only supported pipeline runtime is…” which suggests it may have been architected to allow other pipeline runtimes to slot in?)

[Cheekily asks:] If you try it and it either works for your use case, or doesn’t, could you provide a quick review as a datapoint around whether it works for your sort of use case? #lazyweb

Airflow has docs on how to use papermill and other systems have direct or indirect integrations to that tool now-a-days. Dagster is one example which extends papermill.

Scrapbook is used internally at a few companies now to save notebook outcomes. Some extend it to save into their metadata stores in parallel to the notebook document. There’s a few pending features for scrapbook that I haven’t been able to get to – would love more devs helping on that project.

2 Likes

Hmm. So I guess what I would really want is a UI for Airflow inside JupyterLab. Not that there would really be a problem navigating to Airflow’s localhost… In a JupyterHub scenario, users may not be able to freely navigate to other portals.

I would recommend navigating to the scheduler once you push a scheduled version. The complexities of doing a good UI for workflow management are large. If the user can’t interface with the scheduler service you’ll find the usability of running anything there dramatically reduced. You have to reimplement error handling, monitoring, alertings, etc. As someone who does this work on a daily basis, it’s best to not try to duplicate those types of efforts when possible as you can sink years of time into making it work well.

Ploomber (disclaimer: I’m the author) is a good option for this. It allows you to create notebook-based pipelines (scripts and functions can be used as well). It requires minimal “pipeline code”, just list your notebooks in a YAML file, for full flexibility there is a Python API available. Exporting pipelines to Airflow and Kubernetes (via Argo workflows) is supported.

Github repository.

Feel free to reach out directly if there are any questions! Twitter: edublancas

1 Like