We prepared a repo that pre-builds a docker image using GitHub Actions and then pushes the image to a free DockerHub account. Now, we’re not sure how DockerHub’s pull limits apply to mybinder.org.
Docker states that for anonymous users (no account) there’s an image pull limit of 100 per 6h per IP (https://docs.docker.com/docker-hub/download-rate-limit/, https://www.docker.com/pricing).
So then my question is: how many image pulls will mybinder.org carry out on DockerHubs registry when launching repositories?
(I’m aware that there’s a limit of 100 concurrent mybinder.org repository sessions, but I think for our workshop this should be fine (https://mybinder.readthedocs.io/en/latest/user-guidelines.html#maximum-concurrent-users-for-a-repository))
(Because new users can only put 2 links in a post, I formatted additional links as code blocks)
The short answer is: it depends. There are several different clusters that serve traffic for mybinder.org and the exact setup regarding “docker registry things” depends on the cluster.
First some things which are common to how the clusters are configured:
every cluster has a docker registry which is used to store built images
when a person launches a binder we check if that registry contains an up to date image and if yes use it instead of building the image again
each compute node in each of the clusters has a cache of docker image layers that were recently used. This means sometimes we don’t need to pull the layers from the registry.
each cluster has its own public IP
all layers used to build an image are pushed to the cluster’s docker registry (for example if your Dockerfile in the repository just contains FROM someorg/somereposprebuiltimage:sometag that layer should end up in the registry of the cluster
The thing that is configured differently on clusters is where the cluster’s docker registry is hosted. Some have a dedicated registry (for example hosted on Google Container Registry) and some use docker hub.
Pulls from this registry are IIRC not using credentials, at least for the docker hub case. This means that pulls in point (3) (when the node doesn’t have all layers) fall under the rate limited case. This potentially happens on every launch, but the most likely case is that several launches of the same repository all get scheduled onto the same node on the same cluster. In which case there would only be one pull from the registry.
Pulls of layers required to build an image in the first place are probably also not authenticated (but I’d have to check). This means each time we have to build a new image for a repository there is a chance that we need to pull a layer from docker hub. Why docker hub? Because most base layers are public images which are hosted there. We try hard to schedule builds of the same repository onto the same node in the same cluster to maximise the chances of being able to reuse image layers created during other/previous builds. So in a typical case N builds should not lead to N pulls from docker hub.
In summary: clusters that host their own private image registry might be effected by the rate limit at image build time if the build requires a pull of a layer hosted on docker hub (minimise the chance of this happening by using popular base images which are likely already on the build nodes because everyone else uses them. FROM buildpack-deps:bionic is what repo2docker uses.)
Clusters that use docker hub as their “internal registry” will use up quota for launches as well as builds. Several launches of the same version of a repo within a short period of time probably only use “one amount” of quota because they get scheduled on the same/a few nodes.
So the final answer to your question of “how many pulls per launch?” is: I have no idea, could be as few as zero but could also be “a few”, most likely somewhere in between :-/
We (people running clusters for mybinder.org) should check if we can increase the use of credentials when pulling images from docker hub.
A way to level up your forum skills (like number of links per post) is to participate more in the forum. For example Introduce yourself! is a place you can introduce yourself. That gets you “points” and puts a virtual face to your name
Thanks a lot for the quick and thorough answer! I interpret this as “Worst case scenario, we could use up the rate limit. But in a more realistic scenario, we should be fine”. We will most likely prepare a backup solution anyways, in case we get really unlucky.
Sounds about right. There is a third option: help mybinder.org figure out where/if/how we can use docker hub credentials. But that is a whole new level of commitment
Hi @betatim, afaik the rate limit applies as soon as the image is requested - so even if the layers are cached locally and you do a docker pull it counts against the limit. Authenticating only boosts the limit to 200 (from 100) per 6-hour period. Have you been hit by this already at mybinder.org?