Recently I have been working on an open source library, pynb-dag-runner. Below are some details on this if someone is interested.
Shortly, the library makes it possible to run (pipelines of) Jupyter notebooks using only services provided with a free (personal) Github account and a public repo.
Github Actions are used for compute and to schedule pipelines. The demo pipeline runs daily and for all pull requests.
Github Build artifacts are used to store evaluated notebooks (and any metrics, images, or other files that been logged).
Github Pages are used to host a static website where one can inspect past pipeline runs and evaluated notebooks.
Screenshot of main page that lists past pipeline runs:
The above screenshots are from the below live reporting site:
Of note: this site is hosted on Github Pages and does not require any backend service or other cloud infrastructure. The entire pipeline and reporting runs using only a free personal Github account. The pipeline is scheduled to run daily.
Lovely stuff for the common case of “show that a bunch of notebooks were run” without deploying (another) k8s cluster.
it’s definitely worth having a configuration approach that doesn’t rely on CDN/github. This helps folks who can get stuff from PyPI, but have to play whack-a-mole behind corporate/national firewalls at runtime.
For web assets, ideally this is just shipping all dependencies along with the pip-installable package.
Artifacts are of course harder, with the knee-jerk reaction being an S3 upload… that part might just need to be pluggable.
As an interesting feature direction:
There are some examples of deploying jupyterlite on github pages: in addition to the existing CSV download and notebook preview, adding a jupyter lite build to the publishing pipeline of a PBR site would make it browseable and computable by co-deploying a JupyterLite site.
Agree. Currently only local and Github are supported. But other setups should not be too difficult. Below are some comments on this.
A main feature of the pynb-dag-runner executor is that it can run on ephemeral compute resources. What this means is that there is no dependency on a (24-7 running) tracking server/database or backend service (that keeps track of what tasks have started, which tasks finished successfully, console logs, logged metrics/artifacts, etc etc).
Instead pynb-dag-runner uses OpenTelemetry (an open standard) for emitting this type of events as structured logs. And, since this is an open standard the events can be directed to a JSON file, an s3 bucket or potentially any logging endpoint that can ingest OpenTelemetry logs.
So, the approach is to capture everything of interest as OpenTelemetry logs (including any artifacts or evaluated notebooks). And for reporting the logs can be converted into various formats.
Currently there is only one demo setup, and it uses Github actions (as an example of ephemeral compute). As the run environment is Dockerised, the executor should be very flexible to other setups like eg serverless functions or ECS (AWS Elastic Container Service) etc. The only requirement is that one can somehow capture the logs; otherwise there is no network requirement at runtime.
Of course, if one have access to infrastructure there are existing alternatives for running pipelines/notebooks. Therefore maybe a main current focus has been on the “scaled down no-infra” setting (where I am not sure if there are many other options). But, technically, the pynb-dag-runner executor already uses the Ray framework for running task in parallel. Ray also has support for both K8 and VM-clusters on public clouds. So, Ray has support to scale up to larger workloads on clusters (although this would require some work, I believe).
Yes, Jupyterlite is so cool, and definitely relevant! Thank you for the link, I was not aware of this.
Maybe a potential direction could be to use Jupyterlite to enable interactive dashboards that require some data preprocessing steps or data pipeline (assuming interactive Python dashboards start to support Jupytelite).
Jupyterlite + pynb-dag-runner could maybe be used in combination to maintain open source/open data pipelines with public dashboards. Something like:
If all of this could run from a Github account, anyone could get their own development setup (with code repo, dashboard, reporting site) by just cloning the repo.
There are likely a lot of details here, and this would require some work…