Repo2docker and MyBinder: does image size make a difference?

repo2docker
#1

For https://edu.oggm.org (repo), we need an environment with a lot of packages from the python-GIS stack. Our docker image was 4.3Gb large, so I spent some time moving to a pure pip install, reducing the image size from 4.3Gb to 3.3Gb.

However, the resulting image is still very large because conda is still installed per default (I suppose).

As per https://mybinder.readthedocs.io/en/latest/faq.html#what-factors-influence-how-long-it-takes-a-binder-session-to-start , I understand that reducing the image size is not the biggest priority, but wouldn’t there still be a gain if we managed to get down to 1Gb?

This would only be possible by ditching conda altogether, which is something repo2docker may not be able to do anyway: I just wanted to ask here if this would be a possibility, and especially if reducing the image size is really worth the trouble in the first place.

#2

If you want to chase every last bit of performance/image size I’d recommend starting from something like: https://github.com/binder-examples/minimal-dockerfile but the “minimal” in the name really does mean “minimal”.

I just built https://github.com/binder-examples/requirements which is almost a “nothing but conda and other repo2docker infrastructure” image. It installs a few dependencies like matplotlib, numpy and seaborn.

The image size comes to 527MB according to microbadger. About 70MB of those are (I think) “packages” that I asked for in requirements.txt. Of the remaining image size I’d say 50% are the Ubuntu base image and 50% are from miniconda.

My take away from this is that you can save around 200-400MB by starting from a minimal Dockerfile and tuning exactly what gets installed and how.

It would be interesting if you could push your built image to docker hub so we can see what it looks like in microbadger.

#3

Thanks @betatim ! I’m not that much interested in micro-optimisations, I was mostly wondering if it’s even worth it in the first place ;-).

I’m learning about docker in the process and just pushed an image here: https://hub.docker.com/r/fmaussion/oggm-edu-r2d-pip

The binder files are here: https://github.com/OGGM/oggm-edu-r2d-pip

My goal now is to use this base image for the set-up we discussed here: Repo2Docker: make it easy to start from arbitrary docker image

1 Like
#4

Take a look at https://microbadger.com/images/fmaussion/oggm-edu-r2d-pip vs https://microbadger.com/images/betatim/binder-example-requirements. Each of the “layers” they display is the result of running a command in the docker file repo2docker generates. You can see the ones that are similar between the two images and also which ones are new.

For example your postBuild step seems to be the one that generates the biggest layer (about 400MB). Maybe there is potential for clean up in the commands you run there?

ps. I never know how to estimate docker image sizes. docker images locally seems to show different numbers to microbadger and then how much actually needs transferring seems to be different again. I think most of that uncertainty goes away by looking at the ratio of layer sizes between two images :slight_smile:

#5

Thanks - postBuild is where I install most of the pip packages because of the order in which they need to be installed: https://github.com/OGGM/oggm-edu-r2d-pip/blob/e268070b2a63a98508646b878cc5f26fbdb05554/binder/postBuild . There are 225 Mb of sample data that we download as well.

Yeah, I have trouble to understand what an image size really is as well. Cause locally, running docker images gives a size of 3.31GB (versus 4.28GB for the conda build). I’m already glad we managed to save space in comparison to the pure conda build :wink: