The Binder Federation

if you are wondering what happened keep reading!

We flipped the switch on making mybinder.org a federation. This means that there are now two clusters that serve requests for mybinder.org. What changes for you as a user? Hopefully nothing. You will notice that if you visit mybinder.org (or any other link to it) you will be redirected to gke.mybinder.org or ovh.mybinder.org. Beyond that small change everything should keep working as before (if not please post here!).

We have been planning and working on making this happen for a long, long time. It is great to see it finally go live. A huge thank you to ovh.com who are letting us use their infrastructure to host the second cluster in Europe and put in engineering effort to help build this (:clap: @jagwar and @mael-le-gal)!

By going from one to two clusters we now have the tools to add a third, forth, fifth, n-th cluster. This means that mybinder.org becomes more resilient to outages and (hopefully) easier to finance as we aren’t dependent on one big grant to pay for a single cluster. Our vision is to keep adding clusters that take anywhere from 1% to 1/n-th of the global traffic to mybinder.org.

If your institute or company wants to help run mybinder.org by hosting a small cluster, let us know. We can send you traffic :grinning:. 2% of mybinder.org’s traffic corresponds to about 4-10 user pods running at any one time which can fit on a single moderately large instance.

ps. I am not enough of a Star WarsStar Trek fan, but somehow it feels like we need a cool name/logo for the Binder Federation :smiley:

11 Likes

That is amazing news!

2 Likes

We are now sending about 10% of our users to OVH.

We will keep stepping up the load, fixing problems and tuning the setup over the next few days.

4 Likes

Congrats on getting this started… fingers crossed that other organisations and institutions see why this sort of commons approach is a Good Thing and why they should contribute resource into it.

It would be great if research funding agencies for certain research calls included a phrase along the the lines of: “in order to support public access to computational resources required to replicate, reproduce or engage with software produced as part of this project, you may apply for an additional $Nk to be added to your proposal to fund a public Binder federation node capable of running the project software”. Only written much clearer and sensibly than that… (I’m not much of a policy wonk!)

2 Likes

This is a cool idea, while pondering it:

Should we try and do this on the basis of individual pieces of work or at a higher level? Small amounts are probably easier, however each transaction costs in admin. All else being equal $30k in one transaction means more money actually available to running the service compared to 30 x $1000. Even if the cost isn’t in straight up dollars but “only” in the fact that someone has to file a piece of paper 30 times instead of once.

If someone reading this knows someone in a position to be the third cluster in the federation I’d be happy to get on the phone with them to talk about ideas, pros, cons and generally figuring out how this could work.

Awesome news! I’m curious what’s a medium sized machine and why can it run only 4-10 pods? Are there any low hanging fruits to make binder boot up faster?

Binder launch time is typically almost all image pulling, so image size optimization is the biggest thing.

1 Like

Via a tweet, @modernscientist wondered: “Any plans for mybinder@home ala [Folding@home]”

Interesting idea…! Hmmm… I further imagine an associated BitTorrent setup for distributing large data files?!

Agreed. Institutionally, I could imagine a topslice arrangement. Eg if you’re doing a project with computational stuff in it, topslice an additional $fixed or (fixed % of compute budget) to cover an institutional Binder Federation commitment. From the funder side, they could say “if you run a Binder Federation node, you may (should?) add … etc… to the project funding bid to cover open computation support”.

(I appreciate there may be lots of other HPC / compute etc initiatives, I’m just trying to role play “what if funders got behind Binderhub?” on the one hand, and what steps individual projects might do at the other end to try to support Binder activities which are probably more likely to be departmental or central infrastructure, rather than project, related internally.)

1 Like

What do you mean with “it can only run 4-10 pods”? The amount of traffic sent to the OVH cluster is uncorrelated with the size of the cluster :slight_smile: We are only sending a small amount there to work out all the kinks in the system without impacting too many users while things are broken (and with a new system there are always things that are broken :slight_smile: ). For example at the moment there are issues with the docker registry and some images/layers failing to be pushed.

On the GKE cluster we have n1-highmem-8 (8 vCPUs, 52 GB memory) instances and its configured to fit about 58 pods.

In the OVH cluster the nodes have 4 cores and 15GB memory. We will have to see how many pods we can pack onto each.

The relationship between cluster size, node size and how many pods you can fit isn’t linear. You need a minimum size just to fit all the services you need to run the cluster and then after that it is linear (I guess).

The missing piece for a setup like this (and why we don’t use preemptible nodes that are cheaper) is: what do you do when the node/machine at home suddenly becomes unavailable. Already the biggest UX bug in binder is that stuff “just vanishes” when you get hit by the inactivity timeout. So to be able to take advantage of these opportunistic nodes you’d have to have a good idea on how to solve this problem. My guess is that we’d have to move away from the UI and the compute being in the same pod, and make the UI so it can “reconnect” to a new compute “thing” … and somehow magically transfer over all the state from the compute “thing” that just gave you a 5s warning before it will turn into a pumpkin. I think this qualifies as a hashtag-hard problem.

1 Like

Agreed… and that could cause problems if you have been manually running a notebook / changing kernel state and then get swapped onto another compute node, because the new kernel will presumably be in its original state.

1 Like

Another thought on this: there are actually two compute requirements aren’t there?

  1. build the image;
  2. run the image.

One way of making use of remote compute would be to try to build images. Push a build to N @home clients, get a checksum for the built image back, if there is consensus, accept one of the consensus checksum images back for the image hub. It could be slow, but it might be useful if you need to rebuild every image after eg a repo2docker update that changes the base image or build steps?