Tip: embed custom github content in a Binder link with nbgitpuller

You mean the kaggle kernels Dockerfile? Looks more like a dare if you ask me.

1 Like

If you all do this you should call it THE-KITCHEN-SINK

This is really cool! Has anyone managed to get nbgitpuller links working from within a running binder rather than built into the initial launch? For example in a tutorial setting, everyone launches the base binder and then pastes in links to several different repos without needing to know any git.

This approach definitely works from a dedicated jupyterhub (https://jupyterhub.github.io/nbgitpuller/link.html), but I’m seeing 403: Forbidden errors if I try to open links from within a binder session.

Could you make an example link @scottyhq? And explain again what you would like to do, I am not sure I get it :frowning:.

You start a new binder, then do some stuff in it, then the instructor says “now we need the content of repo X” and gives everyone a link to click that triggers a nbgitpuller actionn in the binder instance I am already running?

You start a new binder, then do some stuff in it, then the instructor says “now we need the content of repo X” and gives everyone a link to click that triggers a nbgitpuller actionn in the binder instance I am already running?

Exactly!

As a concrete example, we start @fmaussion’s really great tutorial : https://mybinder.org/v2/gh/OGGM/oggm-edu-r2d/master?urlpath=git-pull?repo=https://github.com/OGGM/oggm-edu%26amp%3Bbranch=master%26amp%3Burlpath=lab/tree/oggm-edu/notebooks/oggm-edu/welcome.ipynb%3Fautodecode

Which sends us to https://hub.gke.mybinder.org/user/oggm-oggm-edu-r2d-vurgegax/lab?autodecode

We work for a while and then want to try bringing in a new repo (w/o dealing with git and assuming we have all the required packages - https://github.com/ICESAT-2HackWeek/data-access).

Following https://jupyterhub.github.io/nbgitpuller/link.html we get a link that looks like this:
https://hub.gke.mybinder.org/hub/user-redirect/git-pull?repo=https%3A%2F%2Fgithub.com%2FICESAT-2HackWeek%2Fdata-access&urlpath=lab%2Ftree%2Fdata-access%2Fnotebooks%2FNSIDC+DAAC+ICESat-2+Customize+and+Access.ipynb

I’m guessing something isn’t compatible here with the /hub/user-redirect/

1 Like

@choldgraf
I really like this idea. As it currently works, it has an unexpected characteristic that I discovered while playing with the concept on an internal version. We have found it very convenient to be able to use git in jupyter to enable round-trip development by committing changes in the jupyter notebook/lab environment and pushing them back to the remote repo.

As you have this implemented today, the remote is the repo of the environment rather than the repo with the content. I did not expect this. I think that a user would always want any changes to be pushed upstream to the content repo, not the environment repo.

You can see this easily by cat .git/config from a Terminal window. Changes would be pushed to choldgraf/binder-sandbox instead of the probably more desirable data/materials-fa17. Yes, you would not expect somebody who did not have ownership rights to the content repo to try to push anything back, but if you owned it, or someone forked it, I think the remote content repo is the more desirable target.

The repo is cloned to a subdirectory, ~/materials-fa17

@manics Ah. Of course! Thanks.

Riffing on this thread alongside Binderhub button - 'pull from referrer' (and maybe Binder template repositories) I wonder (feature creep ;-)…

Docs around @manics recent https://github.com/jupyterhub/binderhub/pull/891 PR suggests:

In a GitHub repo create a readme with a link to https://binder.example.org/autodetect, if it works the referrer will be parsed and converted into a link to launch the repo you came from.

So what if there was also a redirect saying: “(and) by default/convention look for a “binder-base” branch in the same same directory; if it exists, build / pull that, and then top up with content from an nbgitpulled content repo”.

For example, running:

https://binder.example.org/autodetectwithbase

from https://github.com/user/example would:

  • autodetect https://github.com/user/example as referrer;
  • build/pull https://github.com/user/example/tree/binder-base
  • nbgitpull https://github.com/user/example into the binder image

Complicating further, there may also be a need to allow users to over-ride the gitpulled branch name with an arbitrary one, as well as allowing Binder to autodetect a referral from a branch specifying content from that branch is the content to be pulled in?

1 Like

I’m worried there’s a bit too much magic here, at least to begin with :slight_smile:

As a compromise perhaps there could be a new repo2docker buildpack e.g. environmentrepo.giturl that contains a link to the environment repository/branch? Or there could be a convention for specifying it in the README. Conceptually it’s a bit like a symlink to the files in the other environment repo, internally of course the repos would be handled separately. This requires the notebook repo owner to add this file, but I think implements everything you’ve suggested without any complex logic

@manics That “symlink” approach would work for me :slight_smile:

Just trying to think of a way where a user can easily say:

  • use this branch for the Binder image;
  • use this branch for the nbgitpull;
  • use the referrer as the location of the repo.

I guess things start to get even more complicated if you have the image builder branch and the content branch in different repos…!

However, if you enforce a convention, things get easier; eg easiest for magic to work might be:

  • content repo must be in master;
  • Binder build repo must be in binder-build and must include nbgitpuller;
  • both branches need to be branches of the same repo.

But being able to pop a link into a simple environment file to specify eg the Binder image branch would make sense (you wouldn’t want the link in the Binder build branch because that’s the one we’re trying not to change at at all).

Thinking back on the template repo idea, if a template repo:

  • has a Binder image build branch;
  • has an autodetect path in the README binder button link;
  • has a reference that points back to the build branch using an absolute URL

a user could derive a repo from the template, update the content file, click the button, launch against the Binder image specified originally in the template repo.

Alternatively, you could clone a repo containing a content (master) branch and a build branch and specify the build branch relatively from within the content repo.

If all the metadata is encoded in the readme instead of (or as well as) a environmentrepo.giturl file the readme for the notebook repo could be something like this:

# Notebook Repo

This repo contains just notebooks

Clicking this link will open this repository in binder by detecting the
referrer:

[open with mybinder](https://binder.example.org/autodetect)

This is a special metadata tag that tells binder to build the
environment for this notebook from a GitHub branch called
`branch` in `[example-user]/repository`:

environment-repo: https://github.com/[example-user]/repository/tree/branch

That would be interesting… even easier and it makes the convention more transparent?

Would it also make sense to allow a relative branch in eg https://github.com/[example-user]/repository/README.md or https://github.com/[example-user]/repository/tree/branch such as:

environment-repo: binder-base

that would point to the auto-detected REFERRER/tree/binder-base for similar reason as supporting referrer buttons in sense that if you take from a cloned repo, you don’t need to change any absolute URLs to the original repo?

Another way could be to hide the metadata in a Binder-base button:

[![environment-repo: branch](https://mybinder.org/binder_base_logo.svg)](https://binder.example.org/autodetect)

or

[![environment-repo: https://github.com/[example-user]/repository/tree/branch](https://mybinder.org/binder_base_logo.svg)](https://binder.example.org/autodetect)

but I guess this goes back to overcomplicating things and in this case making the magic a bit more arcane, as well as reducing accessibility by co-opting the alt-text.

relative branch definitely makes sense.

Thinking even further ahead this could be generalized to combine multiple repo2docker repos together. Currently we’re taking about defining how to combine an environment repo with a notebook repo, but you could extend this to combining a base-environment repo, an addon-environment repo (or a jupyter-proxy-server implementation) and one or notebook repos. Though in practice dealing with conflicting dependencies across repos may make this a really bad idea :rofl:.

The main downside of relying on a referrer is that people may choose to hide them for privacy.

Re: the hiding of referrers… sure, but that perhaps suggests a level of sophistication in the user, and for me a large part of the attraction of simple steps for simply combining things is that it can help get a novice started quickly…

Re: extending the idea even further: so are all bets off on ideas we can throw into this thread, then?! :wink:

I think we should restrict ourselves to using “magic” to supplement the user’s experience and not rely on it. For example pre-filling parts of the form to generate badges/links and not as a way to launch binders.

The reasoning being that the magic spell will sometimes but rarely fail which makes it hard to notice for the maintainers of the repository and because it is using “magic” the user won’t be able to rescue themselves. At the very least it would be a large amount of work to design things so that the user does get a good error message, link to explanation why this is happening to them and how to recover.

About 80000 binders are launched each week and maybe 100s (or 1000s??) new repos are made “binder compatible”. Just because of the large difference in numbers I want to make sure we keep the “launching a binder” story as simple and fool proof as possible. If only because it means less, hard to debug support questions :slight_smile:


I am not super excited about using different branches for environment and content. It seems too complicated to explain (given that explaining how binder works is already complex) :-/

Agreed, but I think a progression path does help and the idea is presumably not to force folk to separate things:

  1. start by using one repo with requirements.txt etc in top level
  2. if you have complex dependencies, or maybe want to separate out the Binder dependencies, start to use a binder/ directory;
  3. get fed up when a 20 minute build starts, again, because you corrected a typo in a notebook;
  4. separate content and environment.

I suspect there are also various entry paths and the proportions of users w/ particular skillsets may be different over each of them. For example:

  1. someone developing a package from scratch who wants to be able to demo it via MyBinder;
  2. someone writing a course who needs certain packages preinstalled;
  3. someone who uses a template repo to kickstart a project.

If I saw someone else’s course repo 2 had the environment I want specified in dependency installing files littered at the top level of the repo (not in a binder directory), it might be quite hard for me to set up a similar environment of my own from theirs? (Am I clutching at straws, here?!:wink:

That said, if I was a novice and couldn’t see how someone’s environment that I wanted to reuse was being created / invoked, then it would be a blocker to sharing knowledge / bootstrapping a cloned environment from the one used by that repo.

There’s also the issue of being efficient with time, compute, bandwidth and storage resource (I’m lazy and think computers should be too): why should I have to sit through the rebuild process again and again (time, compute and bandwidth), why should each Binder cluster have to keep creating new cached layers and pushing new images (storage)?

And a final consideration, possibly? Standardisation: if folk are using standardised images, they may come to be optimised and maintained, which is a benefit to everyone using them in that they don’t have to do the maintenance?

To address the slowness of rebuilds I think there are two things to try before building in something that is based on branches.

  1. work is happening to make rebuilds (where only a typo was fixed) faster (https://github.com/jupyter/repo2docker/pull/743 and https://github.com/jupyter/repo2docker/pull/718). It isn’t done and involves some tricky trade-offs but progress is being made :slight_smile:
  2. splitting environment and content is a great idea, just not yet via a built in mechanism based on special branch names. I think seeing people create, use and maintain a (set of) binder-base boxes that provide the env into which content is pulled via nbgitpuller is a great idea.

I think if (2) takes off and sees wide use it is something we can discuss “making it part of repo2docker proper”. Some issues I think we need to get some experience with before making it built in:

  • should I be allowed to specify which SHA1/tag I want for the “env” and which for the “content”?
  • how do we pass the info along in a URL?
  • how to pass that along on the CLI for repo2docker?
  • how confused are people about multiple branches? (It took me a long time to understand how the whole gh-pages thing worked)

repo2docker and BinderHub have on purpose started with tackling the “easy” problems in this space (for example HPC and very large datasets aren’t at the top of the list of problems to solve, there used to be only one UI, Jupyter). And I think this is part of why they are successful at pleasing the broad masses.

Then we slowly work our way up to more complicated problems after having a good story based on what people were already doing (“no new file formats!” etc). This means there is lots of stuff you can’t do with repo2docker’s automagic env building.

Instead we provide an escape hatch called Dockerfile that allows people to do "what ever they want as long as they can write it down in a Dockerfile". And eventually things people keep having to use the escape hatch for get integrated.

I think here we have a similar situation: base boxes and nbgitpuller are possible already, so when there is a large number of people constantly using it we should integrate it. Until then the (by construction) minority has to do it the not-quite-so-convenient way :-/

Yes, agreed… I think conventions evolve, and as and when they do stabilise and become widely adopted then it can become useful to standardise around them, or use code to “accept” the convention and integrate it formally.

That said, it may be useful to discuss possible ways the code might go, what sorts of intermediate conventions / patterns may be worth trying to explore more in the short term (eg patterns around nbgitpuller) and what loosely coupled solutions there may be that could make simplify conventions that may be a bit fiddly for others to work with (so eg an nbgitpuller form to help exploit base boxes, template repos that are set up (already) to use base boxes etc?

In my edu setting, folk don’t want the grief of having to set up environments - they want
to be provided with them because they are not interested in sysadmin; they are package users and work at that level.

It may be that this is not the best place for riffing around ideas, and that a gitter channel is more in keeping with possible flights of fancy, but I get confused about what chat is best kept where! :wink:

1 Like

I like the forum for that, but yeah who knows where we should be doing that.

As part of the free flowing ideas I thought it would be good to mention that I think we are deeply into speculative territory here and that if someone were to make a PR implementing this … I’d vote no on merging it.

Someone should start making binder boxes and docs for explaining how to use them for settings like education where it makes sense to use them!