How to reduce mybinder.org repository startup time

choldgraf · June 24, 2020, 6:17pm

People often ask the question: how can I make my repository launch more quickly on mybinder.org?

This is a short and informal post to share some insights and a few suggestions.

What affects launch time?

The challenge between running mybinder.org vs. a different cloud service such as Colab is that Binder is meant to run arbitrary environments that you define in a GitHub repository. While most online notebooks platforms run a “kitchen sink” environment that has a ton of pre-installed stuff, Binder’s approach is to give users control over the environment for their sessions to encourage more reproducible and well-contained code / analyses / communications / etc. This added complexity (flexible environment generation) adds some time to launches.

Most of the time when a repository is (very) slow (more than 30s0 to launch it is because the environment for that session must be built and initialized. This mostly happens to people “developing” on a repository (constantly changing things and launching right away).

For most users of a Binder link the environment is already built. This is because someone else has previously launched the same version. this can still be slow but not very slow (more than 30s).

mybinder.org runs on Kubernetes, which runs a cluster that grows and shrinks as necessary to take on new users. Each time a user clicks a Binder link, these things happen:

A slot (called a “pod”) is reserved on one of the cloud machines. This takes 1-2 seconds.
Binder looks to see if a Docker image exists for that repository
- If it doesn’t, Binder must first build the image for that repo using repo2docker (this takes time)
Binder looks for a built image on the machine the user will use
- If it isn’t on the machine, Binder must first pull the image onto that machine (this takes time)
Binder launches the user’s session. This includes:
- a small amount of time to start the “init pods to limit network access”,
- a few seconds for the Jupyter process to start,
- a few seconds for BinderHub to notice,
- and finally, your browser needs to follow the redirect.

Each of these steps collectively influences how long it takes for a new session to start. In addition, how much each step contributes to the total launch time depends on the repository.

For example:

if your repository results in a 30GB Docker image, then it will almost certainly take a long time on steps 2 and 3.
if your repository is rarely launched, then when somebody launches it there is a good chance the Docker image won’t be on the machine. This means step 3 will take time instead of being instant.

Generally speaking, steps 2 and 3 contribute the most to a Binder launch. If the Docker image is both already built and already on the machine where a new user is starting their session, then the session should launch in a matter of seconds (our statistics say you should be waiting about 20s or so).

How can I reduce my launch time?

With that being said, in order to reduce the amount of time it takes your repository to launch, try these steps:

Make your repository environment more light-weight - A repository with fewer dependencies and a smaller size will be faster to both build and download into the Binder session.
Ensure your repository gets a lot of clicks - The more often that a repository is launched, the more likely it will already be built and downloaded to a machine when a user starts a new session. As a result, the more popular a repository is, the faster launches will tend to take.
Use two repositories: one for the environment, one for your content - many people change their content much more often than they change the environment needed for it. However, Binder will re-build the environment for any changes to a repository. A hack to get around this is to define an “environment repository” that Binder builds, and use a hook to pull in new content at launch from a “content repository”. This means that your “environment repository” changes less-often, which should result in fewer new builds and reduced launch times. See the instructions in this post to get started.
Use the nbgitpuller.link page to automate separate content/environment repos. The above step can be (mostly) automated by using nbgitpuller.link. This is a little web form that generates JupyterHub links for you. To quickly create a link for content/environment repositories, go here:
```
nbgitpuller.link?tab=binder
```
and fill out the form.

You can also pre-populate the form with some fields. For example:
```
nbgitpuller.link/?tab=binder&repo=https://github.com/binder-examples/requirements
```
will use the binder-examples repository as the “environment” repo.
Contribute to the Binder project - mybinder.org is a volunteer-run service that uses cloud credits and donated infrastructure to operate. There are likely ways that we can improve the performance of launches, but this requires resources. Donating your time, or money, or cloud infrastructure to the Binder project can help us improve Binder for everybody. See this contributing page for inspiration.
Join the mybinder.org federation - mybinder.org is not a single BinderHub deployment, but is in fact a collection of BinderHub deployments run by various teams. If you’d like to run such a deployment, or help maintain and support one of the pre-existing deployments, this could result in more cloud resources available to mybinder.org, which may result in reduced launch times. See the mybinder.org federation page for more information.

Those are a few tips that come to mind, and I hope that they give some inspiration for what you can do to speed things up! If others have suggestions of their own, I’m marking this top post as a wiki, meaning that anybody can edit it

hamel · June 24, 2020, 6:37pm

You can also speed up your launch time by pre-building your Docker Containers with GitHub Actions, so that Binder is only pulling the container and not forced to a full build.

You can use the repo2docker action to accomplish this, particularly you want to set the option MYBINDERORG_CACHE: true

An example would be to drop the following file .github/workflows/binder.yaml into your repo:

name: Build Notebook Container
on: [push] # You may want to trigger this Action on other things than a push.
jobs:
  build:
    runs-on: ubuntu-latest
    steps:

    - name: checkout files in repo
      uses: actions/checkout@master

    - name: update jupyter dependencies with repo2docker
      uses: machine-learning-apps/repo2docker-action@master
      with:
        DOCKER_USERNAME: ${{ secrets.DOCKER_USERNAME }}
        DOCKER_PASSWORD: ${{ secrets.DOCKER_PASSWORD }}
        MYBINDERORG_CACHE: true

This will offload the building of Docker Images on GitHub entirely, thus speeding up your Binder launches incredibly.

How does this work? See the README.

Please let me know if there are any questions.

choldgraf · June 24, 2020, 6:39pm

amazing! Will a new image be created every time the repository’s content changes?

hamel · June 24, 2020, 6:44pm

Yes, a new image will be created each time!

hamel · June 24, 2020, 6:46pm

Images are automatically tagged with the relevant GitHub SHA, so even if someone wants to reference a different branch or a previous commit in Binder they can!

The relevant tag is then added to .binder/Dockerfile so if you switch branches or commits you will pickup the right image

betatim · June 24, 2020, 7:21pm

A search query that brings up (many) other threads where the idea of a “default env” has been discussed and cool ideas pop up: https://discourse.jupyter.org/search?q=binder%20boxes

betatim · June 24, 2020, 7:34pm

Second part of https://github.com/jupyter/repo2docker/issues/917#issuecomment-649016533 is an idea/hunch that what people want is a more compact notation for “i don’t care about env, just want content and will fix env up later”.

Another idea: https://twitter.com/MeganRisdal/status/1262952384487206914 provide some UI to do this.

Another another idea (which I can’t find the link to on twitter any more) is working on letting people edit their notebooks before the kernel has launched. Kaggle recently launched this as a feature and it is what I think basically all variations on BinderHub do. Let people edit/read the notebook while the kernel is still starting. Because people are busy reading they don’t notice that it takes just as long to start the kernel.

matthewfeickert · June 24, 2020, 8:45pm

@hamel This is super great to see! pyhf is a heavy user of Binder, and we would be really interested in using this action (I have a branch already playing with it). In addition to the README in repo2docker-action's GitHub, if I want to learn more can you point me to any public projects using repo2docker-action in the wild as well? This may be a non-issue, but I’m curious given that pyhf uses both the postBuild and apt.txt Binder config files in our binder dir if this then requires us to move those instead into RUN commands in the Dockerfile (I will also play more with this to learn).

choldgraf · June 24, 2020, 9:48pm

I think that the github actions approach is really interesting - though I want to note that I’m a little bit worried about encouraging people to re-work their repositories just to work with GitHub Actions. The goal of Binder is to encourage standardized, reproducible repositories that work across a variety of infrastructure and providers. If you start to add a lot of configuration that is GHA-specific (particularly with Dockerfiles, which Binder generally discourages), then your repository will be less reproducible.

In general, statements like:

I’m curious given that pyhf uses both the postBuild and apt.txt Binder config files in our binder dir if this then requires us to move those instead into RUN commands in the Dockerfile

worry me

matthewfeickert · June 24, 2020, 10:11pm

Agreed. In the ideal case we would be able to use repo2docker-action to prebuild our image for Binder (and the pyhf one is huge given how many dependencies we have if we include all of our computational backends for tutorial reasons) to remove that burden, and then have some way to tell any of the Binder federation servers about the image so that all they have to do is pull it from our Docker Hub registry, but without us having to edit our Binder config file structures.

hamel · June 24, 2020, 11:27pm

@matthewfeickert & @choldgraf I hear you and 100% understand.

The GitHub Action as works today only expects that you do NOT have a .binder/ or binder/ directory at the root of the repo. Other than that, everything (should) just work.

Assuming most people do not have these directories, the only change required to your repo is to drop that yaml file in the .github/workflows directory, and you also must register your DockerHub (or other registry) secrets with GitHub so you can push to the docker registry.

@matthewfeickert let me know what part you need help on, (maybe its just getting started with Actions?). Or an example? Let me know how I can help.

matthewfeickert · June 25, 2020, 5:51am

@hamel Thanks for the follow up. I’ll open up an Issue on repo2docker-action tomorrow to ask more detailed questions so that I don’t derail the conversation here with pyhf’s specifics.

The GitHub Action as works today only expects that you do NOT have a .binder/ or binder/ directory at the root of the repo

Ah okay good to know. pyhf’s Binder setup does use a binder/ directory at the top level where we then take advantage of apt.txt and postBuild to install a particular version of our project.

Or an example? Let me know how I can help.

pyhf is a huge fan of GitHub Actions, and you can say that we maybe even overuse/abuse them. If you have an example project of yours that is public that would be a help.

betatim · June 25, 2020, 6:29am

Maybe we can even further “loosen” the requirement to “must not have one of these directories”. How? There is an order of preference between .binder/ and binder/ (I’d have to check which one wins). This means the GH action could use the highest priority directory for its “magic” leaving the “no binder dir” and “lower priority binder sub-dir” option to the repository owners.

hamel · June 25, 2020, 3:22pm

See this list of Actions Machine Learning Ops | Learn how to use GitHub for automation, collaboration and reproducibility in your machine learning workflows.
Here are some unique projects built with Actions: Machine Learning Ops | Learn how to use GitHub for automation, collaboration and reproducibility in your machine learning workflows.

I can add more to this if you are interested in specific examples. For repo2docker examples, you can see these workflows

matthewfeickert · June 25, 2020, 3:38pm

Ah, no sorry. I meant if you have examples of public projects that are actively using the repo2docker-action. Though I guess the test.yaml workflow in the repo2docker-action covers this.

choldgraf · June 29, 2020, 12:19am

@GeorgianaElena was fast! There is now a mybinder.org section of the nbgitpuller form! (https://github.com/jupyterhub/nbgitpuller/pull/129)

To try it out

http://nbgitpuller.link/
click on mybinder.org
enter in the environment / content repositories you want (make sure the environment repository has nbgitpuller installed)
copy the link and paste it in your browser

that’s it!

story645 · July 1, 2020, 9:01am

@hamel I love the idea of your action but have also been banging my head against it for
a while. I’m landing on this error that I’m not even sure where it is (inside docker, out of it?)

Removing login credentials for https://index.docker.io/v1/
Verified that ***/repo2docker-test:4d518923cafd is publicly visible.
Successfully pushed ***/repo2docker-test:4d518923cafd/create_docker_image.sh: line 89: python: command not found

Is there any chance anyone knows what I’m missing? https://github.com/story645/EAS213/blob/4d518923cafd2ebcc0a0063942372bcef7825e1b/.github/workflows/binder.yaml

name: Build Notebook Container
on: [push] # You may want to trigger this Action on other things than a push.
jobs:
  build:
    runs-on: ubuntu-latest
    steps:

    - name: checkout code 
      uses: actions/checkout@v2
      with:
        ref: ${{ github.event.pull_request.head.sha }}

    - name: update jupyter dependencies with repo2docker
      uses: machine-learning-apps/repo2docker-action@0.2
      with:
        IMAGE_NAME: "story645/repo2docker-test"
        DOCKER_USERNAME: ${{ secrets.DOCKER_USERNAME }}
        DOCKER_PASSWORD: ${{ secrets.DOCKER_PASSWORD }}
        BINDER_CACHE: true
        PUBLIC_REGISTRY_CHECK: true

thanks!

hamel · July 1, 2020, 6:41pm

story645:

    - name: update jupyter dependencies with repo2docker
      uses: machine-learning-apps/repo2docker-action@0.2
      with:
        IMAGE_NAME: "story645/repo2docker-test"
        DOCKER_USERNAME: ${{ secrets.DOCKER_USERNAME }}
        DOCKER_PASSWORD: ${{ secrets.DOCKER_PASSWORD }}
        BINDER_CACHE: true
        PUBLIC_REGISTRY_CHECK: true

@story645 Thanks for letting me know! There was indeed a bug. that I have now fixed (and added a check in CI so hopefully I can catch this next time).

Please try this again, thank you

MichalChromcak · July 3, 2020, 11:26am

Hey @hamel, thanks a lot for putting GHA solution together! Directly jumped on it and it took from roughly 10 mins of waiting time to cca 30s when accessing binder once the image is available.

Still, do you have any idea, why this flow works, I can see Successfully pushed ***/hcrystalball:ff5474687443/create_docker_image.sh: line 89: python: command not found status shows red cross. If I try it, it works, for my colleagues not on the first shot? Docker image with latest sha from master is available od docker hub and is public https://hub.docker.com/r/heidelbergcementds/hcrystalball

hamel · July 3, 2020, 3:07pm

@MichalChromcak @story645 Embarassingly, I made the bug fixes from earlier but I forgot to update the release properly.

I updated the release 0.2 and have tested that it works, (and have also added testing the release itself to CI) .

Sorry about this, if you try things again it should work now.

Topic		Replies	Views
GitHub Actions + Binder Binder community , how-to	7	2360	November 22, 2019
Repo2Docker: make it easy to start from arbitrary docker image discuss	16	3448	April 27, 2019
"reproducible" binder environments with repo2docker, dockerhub and nbgitpuller discuss	10	2138	August 7, 2019
Jovian.ml increased usage in Binder General	8	1885	October 3, 2020
Something up with mybinder.org cache Binder	10	1697	June 21, 2023

How to reduce mybinder.org repository startup time

What affects launch time?

How can I reduce my launch time?

Related topics