Orbiter - Image builder for Jupyter

Hello everyone,

After lots of fiddling around I’m happy to finally announce my first public release of orbiter.

Orbiter started as a small project just for myself and evolved into an pretty generic image builder framework for super small Jupyter Docker images over time.
Orbiter uses a couple of advanced Docker building techniques and it’s interesting to see how the different methods effect building speeds and overall image size.

Builders

Currently Orbiter supports two builder micromamba and alpine. Micromaba uses an Alpine/glibc base image with micromamba as package installer for conda-forge. The alpine builder is build on top of Alpine edge (without glibc) and uses pyenv and pip as package installer.

A more ‘traditional’ builder based on Ubuntu and miniforge is next on my to-do list, but I wanted the images build to be as comparable as possible for the moment.

Images

Currently Orbiter includes definitions for two different image base and geospatial. Base includes Jupyter Lab and the usual suspects like numpy, pandas,altair, matplotlib, nbgitpuller and jupyter-book. Geospatial includes a lot of additional geospatial related libaries. The geospatial image hasn’t seen much testing (yet) and I primarily used it to stress test my building toolchain.

Semantic Versioning

Orbiter automatically uses semantic versioning for complete docker images. So instead of having to pin down your dependencies one by one and hope that they will build exactly the same in future. You can simple use a Dockerfile with ‘FROM dorgeln/orbiter:alpine-base-0.1.0’ and your notebook is assured to run forever.

Monorepo

Instead of having seperate repositories for each build. Orbiter uses a single repository and configuration file to build multiple version of an image assuring that each image is build using the same packages and consistent version numbering across all variants.

Buildkit Caching

Orbiter takes full advantage of the new buildkit --mount=type=cache options that are includes in the latest versions of docker. Using cache mounts speeds up build times a lot, because already build or downloaded packages can be reused between builds and neither micromamba or pip has to restart from scratch only because a single packages was added or removed from your packages list.

Multistage builds

Orbiter support recursive multistage build chains. Where instead of building one image on top of the other (constantly increasing image sizes) the build chain builds a separate build image and only copies the parts needed to run the installed packages to the final image (reducing the size of the final image).

Templating & Automation

Instead of hard coding everything in Dockerfiles. Orbiter makes extensive use of Jinja templating and Invoke task automation and yaml based configuration system.

It can take a while to understand how the different parts fit together. But once you do, you can build your own customized Docker image just by adapting the package definitions in invoke.yaml or simply grab one of the generated Dockerfiles and modify them by hand.

Image sizes

One of the most interesting results of orbiter are the image sizes for images that ‘should’ contain exactly the same python packages.

[atrawog@w orbiter]$ docker images | grep build
dorgeln/orbiter   alpine-geospatial-build-0.1.0       1e71fe9c9191   8 hours ago   2.71GB
dorgeln/orbiter   alpine-base-build-0.1.0             eb9b070cf235   8 hours ago   2.35GB
dorgeln/orbiter   micromamba-geospatial-build-0.1.0   cb6aa1fd89d9   8 hours ago   2.59GB
dorgeln/orbiter   micromamba-base-build-0.1.0         3c4f635b6666   8 hours ago   1.81GB

The alpine multistage build images end up being larger then the corresponding micromamba versions. Primary because pip doesn’t support wheels for Alpine (this will change with PEP 656). So for alpine I have to include a lot of dev libraries to build everything from scratch (the first 2h are painfull to watch, but once the buildkit cache is seeded build times speed up nicely).

[atrawog@w orbiter]$ docker images | grep -v build | grep -v core
REPOSITORY        TAG                                 IMAGE ID       CREATED       SIZE
dorgeln/orbiter   alpine-geospatial-0.1.0             4569121cc65c   8 hours ago   1.42GB
dorgeln/orbiter   alpine-base-0.1.0                   b7b4e3e8705d   8 hours ago   1.04GB
dorgeln/orbiter   micromamba-geospatial-0.1.0         6a93aa01dcfe   8 hours ago   2.41GB
dorgeln/orbiter   micromamba-base-0.1.0               5966fde966ae   8 hours ago   1.7GB

Things are completely the other way round for the final images with a whooping 55% reduction in image size for alpine-base and only a modest 6% reduction for micromamba-base.

The reason behind this big difference is that it’s somewhat hard in a conda bases image to safely remove parts of a package in a generic way without breaking some of the packages or confusing the micromamba package manager.

While the alpine build simply replaces all the -dev packages with the standard libraries and doesn’t include any compilers or build packages in the final image. Which leads to a surprisingly big reduction of the final image size.

The alpine-base image compresses down to a slim image size of 292.97 MB. Which is the smallest size I have seen so far for a Jupyter docker image that includes the basic numpy, pandas, altair, matplotlib data science stack that’s in common use today and nice gauntlet to throw out into the Jupyter community and challenge everyone to build an image that’s even smaller.

3 Likes

I’ve been waiting for there to be more open-source tooing for jupyter docker builds. Have you looked into incorporating S2I? It’s similar to repo2docker but lets you define custom build procedures.