Brainstorm: If you could run a Jupyter Notebook as a pipeline, what would you want it to look like?

eduardobonet · December 8, 2021, 2:58pm

Suppose you have a Notebook for a model. It pulls data, transforms it, creates the model, perhaps do some hyperparameter optimization and then publish it somewhere. Usually, when you want to productize it, the content of the notebook will be moved to pure python to be run within pipelines.

Now, supposed you didn’t need to convert or migrate, but you could have that notebook in itself become the pipeline, potentially with multiple steps, dependencies and parameters. What would you want the notebook to look like?

For example:

How would you like the DAG to be defined? I personally really like the syntax airflow uses t1, t2, and t3 are tasks that are python objects, so it uses the operators for DAG definition:
```
t1 >> [t2, t3]
```
How would you like each step of the DAG to be defined?
- Each cell being a step?
- Using Markdown cells to split sessions?
- Using tags to add names?
- Using function decorators like kubeflow does?
- What about the parameters?

Some options

Papermill: pretty powerful, can be used to run notebooks as a job within a pipeline, but you can’t define a multistep pipeline with it
Kubeflow operators: Building Python function-based components | Kubeflow
Airflow Python operators: PythonOperator — Airflow Documentation

ricklamers · December 14, 2021, 10:31am

Interesting question! Following

pkasinathan · December 14, 2021, 5:28pm

Using new “job” magic (or -j option for any magic) for jobs/steps and “dag” magic to define flow explicitly.

Cell1:
%%job jobname1 Or %%time -j jobname1

Cell2:
%%job jobname1 Or %%time -j jobname1

Cell3:
%%job jobname2 Or %%presto -j jobname2

Cell4:
%%job jobname3 Or %%spark -j jobname3

Cell5:
%%dag
jobname1 >> jobname3
jobname2 >> jobname3

eduardobonet · December 15, 2021, 1:07pm

That’s a great idea. I could think another magic would be necessary for the configuration of the job, since there can be many different parameters depending on the backend, something like:

Cell 1:
%%job job1

Cell 2:
%%job job1

Cell 3: 
%%job job2

Cell 4:
%%config
job1.config(image='python3.9:buster', max_gpu='300mb',.....)

Cell 5:
%%dag
job1 >> job2

Note that since both cell 1 and cell 2 are named job1, their content will be concatenated into a single job in the order they appear.

%% alters the behavior of the cell right? So in this case we are run the Cell 5 it would start the dag, or we could pass --output path to save into a secondary file. Either way we would need to run the notebook to generate the DAG. This covers one use case, but could we somehow do it in a way that we can generate the DAG without executing the entire notebook? I guess we could do this with an additional parser right?

Topic		Replies	Views
Tool for Notebook Workflows? JupyterLab	8	4794	December 30, 2020
Run Jupyter notebooks on Github with reporting to a static website General	2	1007	July 19, 2022
I created a Jupyter cell magic that allows you to use a notebook cell as a DVC pipeline stage Notebook pipelines , reproducibility	0	114	December 1, 2024
Scheduling a Jupyterhub notebook using API JupyterHub jupyterlab , jupyterhub , how-to	1	674	June 10, 2021
Parameterizing Multiple Notebooks Notebook help-wanted	3	644	April 9, 2021

Brainstorm: If you could run a Jupyter Notebook as a pipeline, what would you want it to look like?

Related topics