Tip: speed up Binder launches by pulling github content in a Binder link with nbgitpuller

Riffing on this thread alongside Binderhub button - 'pull from referrer' - #9 by manics (and maybe Binder template repositories) I wonder (feature creep ;-)…

Docs around @manics recent Launch from HTTP referrer, autodetect launch URL by manics · Pull Request #891 · jupyterhub/binderhub · GitHub PR suggests:

In a GitHub repo create a readme with a link to https://binder.example.org/autodetect, if it works the referrer will be parsed and converted into a link to launch the repo you came from.

So what if there was also a redirect saying: “(and) by default/convention look for a “binder-base” branch in the same same directory; if it exists, build / pull that, and then top up with content from an nbgitpulled content repo”.

For example, running:

https://binder.example.org/autodetectwithbase

from https://github.com/user/example would:

  • autodetect https://github.com/user/example as referrer;
  • build/pull https://github.com/user/example/tree/binder-base
  • nbgitpull https://github.com/user/example into the binder image

Complicating further, there may also be a need to allow users to over-ride the gitpulled branch name with an arbitrary one, as well as allowing Binder to autodetect a referral from a branch specifying content from that branch is the content to be pulled in?

1 Like

I’m worried there’s a bit too much magic here, at least to begin with :slight_smile:

As a compromise perhaps there could be a new repo2docker buildpack e.g. environmentrepo.giturl that contains a link to the environment repository/branch? Or there could be a convention for specifying it in the README. Conceptually it’s a bit like a symlink to the files in the other environment repo, internally of course the repos would be handled separately. This requires the notebook repo owner to add this file, but I think implements everything you’ve suggested without any complex logic

@manics That “symlink” approach would work for me :slight_smile:

Just trying to think of a way where a user can easily say:

  • use this branch for the Binder image;
  • use this branch for the nbgitpull;
  • use the referrer as the location of the repo.

I guess things start to get even more complicated if you have the image builder branch and the content branch in different repos…!

However, if you enforce a convention, things get easier; eg easiest for magic to work might be:

  • content repo must be in master;
  • Binder build repo must be in binder-build and must include nbgitpuller;
  • both branches need to be branches of the same repo.

But being able to pop a link into a simple environment file to specify eg the Binder image branch would make sense (you wouldn’t want the link in the Binder build branch because that’s the one we’re trying not to change at at all).

Thinking back on the template repo idea, if a template repo:

  • has a Binder image build branch;
  • has an autodetect path in the README binder button link;
  • has a reference that points back to the build branch using an absolute URL

a user could derive a repo from the template, update the content file, click the button, launch against the Binder image specified originally in the template repo.

Alternatively, you could clone a repo containing a content (master) branch and a build branch and specify the build branch relatively from within the content repo.

If all the metadata is encoded in the readme instead of (or as well as) a environmentrepo.giturl file the readme for the notebook repo could be something like this:

# Notebook Repo

This repo contains just notebooks

Clicking this link will open this repository in binder by detecting the
referrer:

[open with mybinder](https://binder.example.org/autodetect)

This is a special metadata tag that tells binder to build the
environment for this notebook from a GitHub branch called
`branch` in `[example-user]/repository`:

environment-repo: https://github.com/[example-user]/repository/tree/branch

That would be interesting… even easier and it makes the convention more transparent?

Would it also make sense to allow a relative branch in eg https://github.com/[example-user]/repository/README.md or https://github.com/[example-user]/repository/tree/branch such as:

environment-repo: binder-base

that would point to the auto-detected REFERRER/tree/binder-base for similar reason as supporting referrer buttons in sense that if you take from a cloned repo, you don’t need to change any absolute URLs to the original repo?

Another way could be to hide the metadata in a Binder-base button:

[![environment-repo: branch](https://mybinder.org/binder_base_logo.svg)](https://binder.example.org/autodetect)

or

[![environment-repo: https://github.com/[example-user]/repository/tree/branch](https://mybinder.org/binder_base_logo.svg)](https://binder.example.org/autodetect)

but I guess this goes back to overcomplicating things and in this case making the magic a bit more arcane, as well as reducing accessibility by co-opting the alt-text.

relative branch definitely makes sense.

Thinking even further ahead this could be generalized to combine multiple repo2docker repos together. Currently we’re taking about defining how to combine an environment repo with a notebook repo, but you could extend this to combining a base-environment repo, an addon-environment repo (or a jupyter-proxy-server implementation) and one or notebook repos. Though in practice dealing with conflicting dependencies across repos may make this a really bad idea :rofl:.

The main downside of relying on a referrer is that people may choose to hide them for privacy.

Re: the hiding of referrers… sure, but that perhaps suggests a level of sophistication in the user, and for me a large part of the attraction of simple steps for simply combining things is that it can help get a novice started quickly…

Re: extending the idea even further: so are all bets off on ideas we can throw into this thread, then?! :wink:

I think we should restrict ourselves to using “magic” to supplement the user’s experience and not rely on it. For example pre-filling parts of the form to generate badges/links and not as a way to launch binders.

The reasoning being that the magic spell will sometimes but rarely fail which makes it hard to notice for the maintainers of the repository and because it is using “magic” the user won’t be able to rescue themselves. At the very least it would be a large amount of work to design things so that the user does get a good error message, link to explanation why this is happening to them and how to recover.

About 80000 binders are launched each week and maybe 100s (or 1000s??) new repos are made “binder compatible”. Just because of the large difference in numbers I want to make sure we keep the “launching a binder” story as simple and fool proof as possible. If only because it means less, hard to debug support questions :slight_smile:


I am not super excited about using different branches for environment and content. It seems too complicated to explain (given that explaining how binder works is already complex) :-/

Agreed, but I think a progression path does help and the idea is presumably not to force folk to separate things:

  1. start by using one repo with requirements.txt etc in top level
  2. if you have complex dependencies, or maybe want to separate out the Binder dependencies, start to use a binder/ directory;
  3. get fed up when a 20 minute build starts, again, because you corrected a typo in a notebook;
  4. separate content and environment.

I suspect there are also various entry paths and the proportions of users w/ particular skillsets may be different over each of them. For example:

  1. someone developing a package from scratch who wants to be able to demo it via MyBinder;
  2. someone writing a course who needs certain packages preinstalled;
  3. someone who uses a template repo to kickstart a project.

If I saw someone else’s course repo 2 had the environment I want specified in dependency installing files littered at the top level of the repo (not in a binder directory), it might be quite hard for me to set up a similar environment of my own from theirs? (Am I clutching at straws, here?!:wink:

That said, if I was a novice and couldn’t see how someone’s environment that I wanted to reuse was being created / invoked, then it would be a blocker to sharing knowledge / bootstrapping a cloned environment from the one used by that repo.

There’s also the issue of being efficient with time, compute, bandwidth and storage resource (I’m lazy and think computers should be too): why should I have to sit through the rebuild process again and again (time, compute and bandwidth), why should each Binder cluster have to keep creating new cached layers and pushing new images (storage)?

And a final consideration, possibly? Standardisation: if folk are using standardised images, they may come to be optimised and maintained, which is a benefit to everyone using them in that they don’t have to do the maintenance?

To address the slowness of rebuilds I think there are two things to try before building in something that is based on branches.

  1. work is happening to make rebuilds (where only a typo was fixed) faster (https://github.com/jupyter/repo2docker/pull/743 and https://github.com/jupyter/repo2docker/pull/718). It isn’t done and involves some tricky trade-offs but progress is being made :slight_smile:
  2. splitting environment and content is a great idea, just not yet via a built in mechanism based on special branch names. I think seeing people create, use and maintain a (set of) binder-base boxes that provide the env into which content is pulled via nbgitpuller is a great idea.

I think if (2) takes off and sees wide use it is something we can discuss “making it part of repo2docker proper”. Some issues I think we need to get some experience with before making it built in:

  • should I be allowed to specify which SHA1/tag I want for the “env” and which for the “content”?
  • how do we pass the info along in a URL?
  • how to pass that along on the CLI for repo2docker?
  • how confused are people about multiple branches? (It took me a long time to understand how the whole gh-pages thing worked)

repo2docker and BinderHub have on purpose started with tackling the “easy” problems in this space (for example HPC and very large datasets aren’t at the top of the list of problems to solve, there used to be only one UI, Jupyter). And I think this is part of why they are successful at pleasing the broad masses.

Then we slowly work our way up to more complicated problems after having a good story based on what people were already doing (“no new file formats!” etc). This means there is lots of stuff you can’t do with repo2docker’s automagic env building.

Instead we provide an escape hatch called Dockerfile that allows people to do "what ever they want as long as they can write it down in a Dockerfile". And eventually things people keep having to use the escape hatch for get integrated.

I think here we have a similar situation: base boxes and nbgitpuller are possible already, so when there is a large number of people constantly using it we should integrate it. Until then the (by construction) minority has to do it the not-quite-so-convenient way :-/

Yes, agreed… I think conventions evolve, and as and when they do stabilise and become widely adopted then it can become useful to standardise around them, or use code to “accept” the convention and integrate it formally.

That said, it may be useful to discuss possible ways the code might go, what sorts of intermediate conventions / patterns may be worth trying to explore more in the short term (eg patterns around nbgitpuller) and what loosely coupled solutions there may be that could make simplify conventions that may be a bit fiddly for others to work with (so eg an nbgitpuller form to help exploit base boxes, template repos that are set up (already) to use base boxes etc?

In my edu setting, folk don’t want the grief of having to set up environments - they want
to be provided with them because they are not interested in sysadmin; they are package users and work at that level.

It may be that this is not the best place for riffing around ideas, and that a gitter channel is more in keeping with possible flights of fancy, but I get confused about what chat is best kept where! :wink:

1 Like

I like the forum for that, but yeah who knows where we should be doing that.

As part of the free flowing ideas I thought it would be good to mention that I think we are deeply into speculative territory here and that if someone were to make a PR implementing this … I’d vote no on merging it.

Someone should start making binder boxes and docs for explaining how to use them for settings like education where it makes sense to use them!

Hi, there. Nice tips. Looks like these posts are two years old. Still having the above issue:

  • the subpath version works
  • not the urlpath one

And what’s about launching jupyter lab?

PS. Two more things:

  • retrieving (curl…) the (often updated) notebooks in the environment using postBuild actually does the job without any convoluted :grimacing: url
  • it would actually be nice to use the content repo as the main repo (with full options, lab @ everything else) while first retrieving the environment from a stable binder build; rather than doing the converse

This last point is clearly subsumed by any of the more advanced proposals in this post. So what is the current state of this issue?

PPS. Actually, why not systematically use postBuild to git clone the repo with the content, and then just the usual syntax (no need to gitpull, then)?

I have used a similar route in the past, except I put the pull in the start. For example, see here. nbgitpull is a convenience tool. A lot of users of the MyBinder system for education are not overly comfortable with using git and Github. One way to look at it is that is more for actively teaching when you want to combine more on the “fly” or mix-and-match in a less permanent with just a URL as the linking mechanism. You have realized you have options when you have more development time or want a permanent link between the two.
However, someone looking for the content has to realize to look elsewhere when you pull content elsewhere not using nbgitpuller. The nbgitpuller URL sort of makes the connections apparent to those with some knowledge of parsing the URL.

start is a better choice because you don’t have to remember to rebuild from the main repository periodically to bake the new content into the image that will get used to launch the session. See here about start as a configuration file. With start it will actively get all the content from the content contributing repo just as the session from the main build is beginning. I think there is though a time consideration here but even with a lot of notebooks, I don’t think I noticed an issue. Example here.

And of course you could use curl instead to pull from the content source repo. I’ve used it or git clone in different places. If you have a complex git system connecting repos, git clone is more robust and more maintainable.

Hi @fomightez, thanks for the info: having a look at the start option. And yes, I am actually git cloning (—single-branch) the content repo.

Hi, @fomightez. I followed your approach, adapting your start file. Everything works great EXCEPT the links from my newly copied index.ipynb are about:blank#blocked. I’ve checked the path/filename over and over. I allowed pop-ups for mybinder. There’s no switch I have to set to let these through? If it’s a browser setting, I’m hoping you came across this with your students. I’m new to these things and fumbling in the dark a bit. Thanks.

I can link into .txt and .md files in that folder, just not .ipynb ones. I’m using % for spaces in the filename.

It’s a simple matter of not escaping the URLs correctly.

Upon launch in your index.ipynb file, I see the following as markdown code for your first link:

[Lesson 00 b0](notebooks/Lesson%00%b0.ipynb)

That link won’t work because it isn’t a correct relative link. You can see what they should be easily by looking at the URL in GitHub for the corresponding notebook. If you view that notebook on GitHub here, you see the URL is the following:

https://github.com/sawula/datasci_mods_content/blob/300ae348e4b5657d467db42ebb00b6b80e92ac49/notebooks/Lesson%2000%20b0.ipynb

The important part to pay attention to is at the far-right end: Lesson%2000%20b0.ipynb.
That’s how you properly escape a URL with spaces. I suspect there are tools and sites that do this; however, usually just copying from GitHub is easy enough. Or at least looking at it to get the correct pattern.
So in your markdown in your notebook, replace what you have for what GitHub shows. In other words, for this example, you should have in your index.ipynb the following:

[Lesson 00 b0](notebooks/Lesson%2000%20b0.ipynb)

Then your links will work.

I suspect it wasn’t the file type causing you issues, but your escaping the URL error.

The properly-written launch URL for yours should be the one below:

https://mybinder.org/v2/gh/sawula/datasci_mods/main?urlpath=lab/tree/index.ipynb

Then the index notebook will already be opened.


Also in the future, please include a link to the repo you are discussing.

1 Like