Brainstorm: If you could run a Jupyter Notebook as a pipeline, what would you want it to look like?

Suppose you have a Notebook for a model. It pulls data, transforms it, creates the model, perhaps do some hyperparameter optimization and then publish it somewhere. Usually, when you want to productize it, the content of the notebook will be moved to pure python to be run within pipelines.

Now, supposed you didn’t need to convert or migrate, but you could have that notebook in itself become the pipeline, potentially with multiple steps, dependencies and parameters. What would you want the notebook to look like?

For example:

  1. How would you like the DAG to be defined? I personally really like the syntax airflow uses t1, t2, and t3 are tasks that are python objects, so it uses the operators for DAG definition:

    t1 >> [t2, t3]
    
  2. How would you like each step of the DAG to be defined?

    • Each cell being a step?
    • Using Markdown cells to split sessions?
    • Using tags to add names?
    • Using function decorators like kubeflow does?
    • What about the parameters?

Some options

4 Likes

Interesting question! Following :slight_smile:

Using new “job” magic (or -j option for any magic) for jobs/steps and “dag” magic to define flow explicitly.

Cell1:
%%job jobname1 Or %%time -j jobname1

Cell2:
%%job jobname1 Or %%time -j jobname1

Cell3:
%%job jobname2 Or %%presto -j jobname2

Cell4:
%%job jobname3 Or %%spark -j jobname3

Cell5:
%%dag
jobname1 >> jobname3
jobname2 >> jobname3

That’s a great idea. I could think another magic would be necessary for the configuration of the job, since there can be many different parameters depending on the backend, something like:

Cell 1:
%%job job1

Cell 2:
%%job job1

Cell 3: 
%%job job2

Cell 4:
%%config
job1.config(image='python3.9:buster', max_gpu='300mb',.....)

Cell 5:
%%dag
job1 >> job2

Note that since both cell 1 and cell 2 are named job1, their content will be concatenated into a single job in the order they appear.

%% alters the behavior of the cell right? So in this case we are run the Cell 5 it would start the dag, or we could pass --output path to save into a secondary file. Either way we would need to run the notebook to generate the DAG. This covers one use case, but could we somehow do it in a way that we can generate the DAG without executing the entire notebook? I guess we could do this with an additional parser right?