Tip: speed up Binder launches by pulling github content in a Binder link with nbgitpuller

choldgraf · April 30, 2019, 12:44am

Something people often want is to de-couple the content of a repository from the environment that is needed to run it. This would allow you to update the content of a repo without needing to re-build the Binder needed for it.

One option is to use a tool called nbgitpuller. This a tool for quickly pulling in GitHub content into a JupyterHub. You can create links that, when clicked, will automatically pull in content into a user’s workspace.

nbgitpuller lets you share a link with this structure:

<your-jupyterhub-url>/<user-server>/git-pull?repo=https://github.com/data-8/materials-fa17

We can take advantage of this in Binder to share a similar link with a BinderHub. Here’s how:

Step 0: Your repository structure

In this example, we’ll have two repositories. The environment repository will have all the Binder configuration files to define the environment you’d like. It’s what Binder will “build”. The content repository will only have the content you want to share, not the environment files.

Step 1: Prep your environment repository

First set up the repository with whatever environment you wish. Then, make
sure that it works with nbgitpuller by following these steps:

In a requirements.txt file, make sure this line is there:

...
nbgitpuller
...

In a postBuild file make sure this line is there to activate nbgitpuller

...
jupyter serverextension enable --py nbgitpuller --sys-prefix
...

Create an nbgitpuller Binder link

Next, create a custom Binder link that points to the content you want users to see when they click the link. To do so, use the same nbgitpuller syntax described above along with the BinderHub urlpath parameter.

https://mybinder.org/v2/gh/<your-username>/<your-environment-repo>/master?urlpath=git-pull?repo=<url-of-your-content-repo>

For example

For example, here’s a Binder link that uses my binder-sandbox repository to define the environment, and that pulls in the content from the data-8 Fall 2017 course:

You can use any URL you’d like for the “content repository”!

That’s it!

This should let you share Binder links that all have the same environment, but that serve users arbitrary content!

psychemedia · May 1, 2019, 6:27pm

This is really neat… but a little arcane, perhaps, and yet another thing to remember™?

I could imagine a separate UI for this - “Binder environments”, maybe? — where you have a form that lets you specify:

a build repository
a content repository.

The build repository is the one that gets run through repo2docker; the content repository is the one that gets gitpulled.

Adding more form elements and switches to the current page would just overcomplicate the logic of the page and its appearance to users?

A “Binder environments” page might also specify various default build repos (cf the Jupyter stacks, but you could maybe have base containers for different subject areas: stats, chemistry, astronomy, cartography etc)?

choldgraf · May 2, 2019, 2:18am

Totally - this post was totally a “Chris just realized this was possible and wanted to write down how to do it because otherwise he’d forget” post

It’d be cool to prototype UI for this kinda thing in an “unofficial” sense. E.g., you could whip together whatever HTML+JS you want that’d build the proper Binder URL to launch links like this (though you’d need to ensure that the build images had nbgitpuller)

also re: base images and stacks, you may also be interested in this issue:

would love your thoughts!

psychemedia · May 2, 2019, 11:38am

Could / should repo2docker inject nbgitpuller to the build ? MyBinder could detect it in the calling URL, but how would that propagate? Can repo2docker force additional requirements into a build eg via a command line argument?

The user defined base image for repo2docker is interesting, but what then happens to all the things that repo2docker might normally add in to the container on top of the base container. Would it continue to do that? Should it?

How would changing the workflow to use a different base container compare with using a custom Dockerfile in binder/?

choldgraf · May 2, 2019, 2:43pm

A few thoughts there:

Could / should repo2docker inject nbgitpuller to the build ? MyBinder could detect it in the calling URL, but how would that propagate? Can repo2docker force additional requirements into a build eg via a command line argument?

I think if it’s both lightweight and a common-enough use-case, we could make a case for it. For now I think it’s too uncommon to justify adding it in there by default, just my 2 cents tho.

The user defined base image for repo2docker is interesting, but what then happens to all the things that repo2docker might normally add in to the container on top of the base container. Would it continue to do that? Should it?

It would - the idea is just that you can configure what the starting point is, but the r2d machinery is the same on top of it.

How would changing the workflow to use a different base container compare with using a custom Dockerfile in binder/ ?

The goal of “start with a custom Docker image” is more for administrators to choose this rather than users. E.g. if I set up a BinderHub aimed at the GeoSciences, I can choose a base image that has a few very common packages installed, then either a lot of users don’t need to install them, or it’s a much shorter process.

psychemedia · May 2, 2019, 3:12pm

Right… the repos I branched here are variously flavoured in the builds for demo packages in different subject areas. Putting all the dependencies from the different subject areas into a single Binder image would have made for a very large image and dependency management hell…

choldgraf · May 2, 2019, 3:27pm

for sure - this I don’t think would ever be deployed for mybinder.org, it’s more for people deploying group-specific BinderHubs where a lot more assumptions can be made about the space of possible packages needed

yuvipanda · May 2, 2019, 5:09pm

This is great! Thanks for digging this up, Chris.

There’s an nbgitpuller link generator at https://jupyterhub.github.io/nbgitpuller/link, and it just got a canvas option. We could add a mybinder.org option?

psychemedia · May 2, 2019, 5:12pm

Ooh… that looks handy…

choldgraf · May 2, 2019, 5:30pm

I was thinking the same thing actually!

betatim · May 2, 2019, 8:28pm

I think we have consensus on what “custom base images in repo2docker” should look like: Make it possible to configure the base image · Issue #487 · jupyterhub/repo2docker · GitHub and the next few comments after the linked one (plus really the whole thread for several options that were considered, trade-offs, prototypes and alternatives. I like the converged upon idea and would recommend we try to implement it before re-opening the discussion. Otherwise we spend forever talking and not so much doing. We aren’t yet a committee

I’d be -1 on adding nbgitpuller to repo2docker because it is niche. A constant battle is the perception that repo2docker created images contain “bloat” or “for real uses cases one needs a custom Dockerfile to remove bloat” etc etc. So we should make an effort to keep thing slim (because they are!) which means only adding things to core repo2docker that are used very widely, even if (like nbgitpuller) they don’t actually increase the image size all that much. Instead more documentation and “cookie cutter” repos specialised to these use cases.

Those are my thoughts on how to address this.

On a more positive note: having new, other, more user interfaces and user experiences for how to create your “binder link” is very cool. They can and should be hosted/built separately of BinderHub. The fact that it doesn’t need the central oversight committee to agree to any of it is a feature

psychemedia · May 3, 2019, 9:05am

Just caught up with that thread: extensions / plugins for repo2docker, brilliant. Makes for easier community contributions. Here’s a related line of thinking from Simon Willison on datasette plugins.

betatim · May 3, 2019, 12:02pm

Thanks for the link to datasette. Simon seems to have settled on pluggy/likes it so I am adding that to my list of things to checkout. I really like the idea of (one day) having something like plugincompat.herokuapp.com for repo2docker.

psychemedia · May 3, 2019, 12:13pm

datasette also has a range of tools for packaging and deploying datasettes: datasette publish. There are various issues in the repo where Simon runimated on various aspects of this I think.

The use case is much more limited / constrained than repo2docker, but again, ways of doing things and if different ecosystems merge on good practice or common utils, that’s handy for future projects. (eg datasette uses click for command line interface IIRC).

Apols if this is a distraction; I try to make sense of Jupyter / where it might be developing, by trying to make sense of it in context of other things I don’t really understand either!

psychemedia · May 7, 2019, 7:02pm

@choldgraf Just been playing with this quickly and it’s absolutely bonkers in a brilliant way:-)

eg https://mybinder.org/v2/gh/ouseful-testing/binder-graphviz/master/?urlpath=git-pull?repo=https://github.com/hchasestevens/show_ast

eg provides a mechanism for showing repo maintainers how their repo looks in mybinder, and can also be used to demo requirements for making it runnable ex- of their repo and ex- of requiring a PR on it.

Reminds me of URL hacking to chain different things across different APIs. Can it also accept a redirect to open into a specified notebook?

@yuvipanda Adding a tab for Binderhub to the nbgitpuller link generator would be really useful I think… especially if folk found out about it…

betatim · May 8, 2019, 8:20am

This is pretty wild So wild we should publicise it more!

psychemedia · May 8, 2019, 11:08am

I’m wondering as well as nbgitpuller, there could be a generic (curl, wget etc) pull that could pull eg data files from a URL?

betatim · May 8, 2019, 12:21pm

Not sure what I think about making it possible to fetch arbitrary URLs. For getting data my personal view is that one should use GitHub - binder-examples/getting-data: How to get data into your Binder.

One thing that feels uncomfortable to me with my mybinder.org-operator hat on is that if we let people construct URLs that make mybinder.org take action via its high bandwidth connection we become a more attractive target for being the source of a DOS. You can trigger a mybinder.org launch from a very slow dial up connection and if that launch then starts a (very large) download you nnow have a way to amplify your impact. (This is probably also true for letting people pull from git repos and generally true if we let people perform outgoing network connections from within mybinder.org but it still feels like making it “too easy” to do :-/)

psychemedia · May 8, 2019, 12:28pm

Ah, right… yes… understood. Other factors…

psychemedia · June 10, 2019, 11:06am

Pondering this a bit more eg in context of this repo which is gitpulled into this Binder build (discussion), I wonder…

Github conventionally uses the gh-pages branch as a “reserved” branch for constructing Github Pages docs related to a particular repo.

The binder/ directory in a repo can be used to partition Binder build requirements in a repo, but there are a couple of problems associated with this:

a maintainer may not want to have the binder/ directory cluttering their package repo;
any updates to the repo will force a rebuild of the Binder image next time the repo is run on a particular Binder node. (With Binder federation, if there are N hosts in the federation, after updating a repo, is it possible that my next N attempts to run the repo on MyBinder may require a rebuild if I am directed to a different host each time?)

If by convention something like a binder-build branch was used to contain the build requirements for a repo, then the process for calling a build (by default) could be simplified.

Eg rather than having something like:

https://mybinder.org/v2/gh/colinleach/binder-box/master/?urlpath=git-pull?repo=https://github.com/colinleach/astro-Jupyter

we would have something like:

https://mybinder.org/v2/gh/colinleach/astro-Jupyter/binder-build/?urlpath=git-pull?repo=https://github.com/colinleach/astro-Jupyter

which could simplify to something that defaults to a build from binder-build branch (the “build” branch) and nbgitpull from master (the “content” branch):

https://mybinder.org/v2/gh/colinleach/astro-Jupyter?binder-build=True

Complications could be added to support changing the build branch, the nbgitpull branch, the commit/ID of a particular build, etc?

It might overly complicate things further, but I could also imagine:

automatically injecting nbgitpuller into the Binder image and enabling it;
providing some sort of directive support so that if the content directory has a setup.py file the package from that content directory is installed.

Topic		Replies	Views
Embed binder-related metadata in notebook? Binder	8	1336	August 11, 2021
Repo2Docker: make it easy to start from arbitrary docker image discuss	16	3433	April 27, 2019
Improve documentation for new users not working on the master branch mybinder.org ops	12	2023	August 6, 2020
"reproducible" binder environments with repo2docker, dockerhub and nbgitpuller discuss	10	2129	August 7, 2019
How to reduce mybinder.org repository startup time discuss	60	42289	December 1, 2022