I want to create a sort-of library of notebooks within one github repository. The notebooks will do very different tasks and each have their own python environment. In fact they might not be exclusively python based. It would be great if each had their own mybinger.org link, but from my understanding repo2docker works off the github repo. I would want to instead have docker images created from folders within the repository. Is such an idea possible?
It’s not currently supported. Here’s a related thread with some workarounds:
Shall we move the discussion there?
With some colleagues, we might have a look at this over the weekend. I am totally not an expert in Docker and Binder, but the workarounds suggested in the previous discussion look like a good place to start. Thanks!
Taking a step back: Why do you want to have separate environments on mybinder.org?
Depending on why you want to have different container images for the notebooks the answer might be “you don’t actually want different images”. Let me explain
Assuming three things:
- the environments are not incompatible with each other (all dependencies can be installed in one env)
- launching the image is more common than building the image (for every commit you have at least two launches)
- you want fast launches for your users
Under those assumptions there are benefits to using the same container image for all of the notebook on mybinder.org. The combined image might be larger than any individual image but it might still lead to faster launches for your users.
This is because different images can be assigned to different clusters. In which case each cluster will have to build the image. If all your launches use the same image, there is a good chance that all launch attempt get assigned to the same cluster (no rebuilding on first launch) and they might even all get launched on the same node (no transferring of the image from the registry to the node).
When you make a change to your repo and it needs re-building we try very hard to assign the re-build job to the same node in the same cluster that built the original image. This increases your chances of the build process reusing as many layers as possible through the magic of docker caching.
Both launches and re-builds rely on things being cached. We have large caches on the nodes, but eventually we do have to empty them. We try and empty them starting with the least recently used stuff. Another reason to share images because then you get the combined “oumph” of all launches, instead of spreading your launches across N images, which will make them look less popular.
Of course all the caching and re-using is an optimisation. Nodes come and go, there might be other super popular images crowding you out of the caches, etc. However the worst case in the shared image and many images case is the same. But you could get an edge from sharing the image.
(I can construct counter examples in my head but I’d classify those as edge cases. For example each individual image is super small and fast to build, but the combined one is huge and slow to build (despite consisting of instructions which individually are fast.)
So overall I’d take a step back and ponder why you want to split things. Maybe even do some experimenting to inject some data into all this hypothesising One of my favourite quotes “benchmarking gives me a leg up on all those who are too good to benchmark before optimising”.
Yes. The group I am working with were divided on this. We want to develop a public library of notebooks where anyone can submit their work. Therefore the list of dependencies is to a degree unknown. But your points look spot on. The solution might be to try to anticipate all dependencies in the repository and then have individual installs within the first cell of the notebooks to catch what is not already there. I gave it a shot right now here https://github.com/johnjarmitage/notebook-library and it works nicely for my one example. The Friday night speculation is because the hackathon is tomorrow…
The “install it all” approach can take you a very long way, every data-science package ever in one big docker image:
(not advocating this, more of an item in the cabinet of curiosities)
And you can make it run on Binder if you want to: