Suppose you have a Notebook for a model. It pulls data, transforms it, creates the model, perhaps do some hyperparameter optimization and then publish it somewhere. Usually, when you want to productize it, the content of the notebook will be moved to pure python to be run within pipelines.
Now, supposed you didn’t need to convert or migrate, but you could have that notebook in itself become the pipeline, potentially with multiple steps, dependencies and parameters. What would you want the notebook to look like?
For example:
-
How would you like the DAG to be defined? I personally really like the syntax airflow uses t1, t2, and t3 are tasks that are python objects, so it uses the operators for DAG definition:
t1 >> [t2, t3]
-
How would you like each step of the DAG to be defined?
- Each cell being a step?
- Using Markdown cells to split sessions?
- Using tags to add names?
- Using function decorators like kubeflow does?
- What about the parameters?
Some options
- Papermill: pretty powerful, can be used to run notebooks as a job within a pipeline, but you can’t define a multistep pipeline with it
- Kubeflow operators: Building Python function-based components | Kubeflow
- Airflow Python operators: PythonOperator — Airflow Documentation