Variable startup times with a RStudio based binder example

I made a minimal binder demo example that uses RStudio.
Here is the GitHub repo and here’s the launch button: Binder

When it was initially building the image, it took a long time (around 30 minutes or more). But that’s ok. And I realize whenever I change the repo at all, it will rebuild the image, and that will take another 30 minutes. Following this advice, I see could try to use two repos, one for content and one for environment, to avoid that.

However, the issue I seem to be facing now is that, despite the image being built and no further changes being made to the repo, the launch time of any instance is very variable. I just tried many times in the last hour or so. In a few cases, it goes from click of button (i.e above) to usable RStudio session is about 10-15 seconds, which is great. In other times, it takes about one minute, and in other times it takes about three minutes.

Am I missing a trick here? Is there anything that can be done to make a prebuilt image consistently load at the top speed of around 10-15 seconds?

Long story short: try and make your image as small as possible. Beyond that there aren’t many things you can control related to start up time. Below some explanation about how things work to explain why image size is the thing I’d focus on.

There are a few reasons why a pre-built image can be faster and slower. The only time you get the 10-15s experience is if the image you are requesting is already present on a node in our cluster that is also available to launch it now. This is the best case.

The next best case is that a node is available to launch your image but doesn’t have a copy of it. Then it needs to fetch that image first. How much time that takes depends on which cluster you ended up on and more importantly how large the image is (you can control this factor to some extent).

The next best case is if a new node needs to be booted and then needs to fetch your image. We try to boot nodes ahead of time but sometimes the scale up of demand is so fast that we can’t stay ahead. Booting a node takes ~5-10min. I think it is very rare for an average user to run into this situation.

The next best case is that your image isn’t in the registry and needs to be built.

The next best case is that is down and you will have to wait until it is fixed :smiley:

My guess is that the difference in times you see is because you were assigned to a node that needs to fetch the image (minutes of wait time) and a node that already had the image (tens of seconds wait time).

Unfortunately there isn’t much we can do. Popular images tend to be present on many nodes and stay in the cache of the nodes. Not so popular images get evicted from the cache. This is probably the best way to setup a cache (recently used stuff remains in the cache, least recently used is evicted). Unfortunately, from the point of view of a repo owner, most images are unpopular because the repo owner is (to be a little dramatic) is the only user of that image.

Please don’t try and artificially make your repo popular (some might think of scripts to launch it or such). If doing that becomes a popular past time we will probably just ban the offending repos and then have to spend time on figuring out automatic defences against this. Instead of spending time on actual features :wink: .

The thing I am most excited about in terms of improving this situation is being able to start a container without having to transfer the whole image first. There is some amazing work in the docker/container community happening on this. For example GitHub - containerd/stargz-snapshotter: Fast docker image distribution plugin for containerd, based on CRFS/stargz however as far as I know we are still some ways away from being able to deploy that for Both in terms of it being ready enough as well as having expertise in the team on how to do this. If you or someone has tried this or keeps a close eye on it … please let us know :smiley:

1 Like

@betatim Thanks for that very comprehensive reply, which was very enlightening.
I think the issue I was facing recently was that the node I was using did not have the image and had to get a copy of it. Sometimes I was on a node with the image, and the launch time was fast.

I presume when we are talking about “images”, we are talking a Docker image. I use Docker for RStudio server based containers, i.e. using rocker as the base image, on my local machine. What is noticeable is how large they can become, e.g. up to 8-10GB. This is fine locally, but if the Rstudio binder images were nearing that size, that would explain a lot.

In terms of keeping the image size minimal, in the example Github repo mentioned above from which the image was made, the contents were quite minimal: a oneliner runtime.txt, a oneliner install.R (inst, and a small R script. The install.R was getting the tidyverse package (of packages) installed. I’m not sure if I can make anything smaller than that in terms of the repo contents. But perhaps there is a way I can make that the base Docker image being used by repo2docker to be smaller. I am new to binder, so I have not really looked into these possibilities yet, so I will try to learn more.

In future there may be other ways to speed up large images by taking advantage of Docker’s layer caching. For instance, if there was a large 10GB base image that was widely used and you added 1GB to it, if that base image was sufficiently popular it might already be on all nodes so only the additional 1GB layer would be downloaded, whereas if you were to build a smaller fully customised 5GB image that might have to be downloaded from scratch.

That’s mostly theoretical at the moment, but there is a discussion on Some steps to explore supporting a "default environment" · Issue #1474 · jupyterhub/ · GitHub about having a “default” binder image.

1 Like

There is already some layer caching and sharing in practice between docker images no? In repo2docker we try and build images so that they can share as many layers as possible. So two R images that follow the recipe from the example repo should share all layers until the one which copies in the content of the repository itself. However i don’t know how much that is true in practice/if someone has explicitly checked for R based images. We did a bit of work on this for conda based ones a while back though.

You’re right, layer caching already exists and it definitely helps where config files are identical. I was more thinking of a future optimisation. Variable startup times with a RStudio based binder example - #3 by Mark_Andrews mentions the rocker base image for a size comparison, and I was just pointing out that size isn’t always an overriding factor if the images can be built in a smart way.

There’s probably an interesting research project in here somewhere :smiley:

1 Like