Resources required to operate a BinderHub

This post is about documenting the resources used by various BinderHub deployments in order to make it easier for others who want to deploy one to estimate how many resources they need. @nuest recently asked on gitter about what kind of resources he should request for a hub at his institute so I thought it would be a good idea to write up my response here.

If you operate a BinderHub (private or public) or know of one and have an estimate of the resources it uses please post here.


https://mybinder.org

We operate a setup that autoscales, so the number of nodes (computers) in the cluster changes over the course of a day and week. You can check out the size of the cluster and number of concurrent pods via our public grafana dashboard.

On average we have around five user nodes each with 8CPUs and 52GB of RAM. The size of this pool of nodes can scale all the way down to zero.

We also operate a ā€œcore nodeā€ that has 4CPUs and 26GB of RAM. This node is used for support services, the binderhub pods and the jupyterhub pods. This node will always be up. User pods are not allowed to execute on this node.

Our kubernetes cluster is hosted on GKE. We have been very happy with the service we get, basically there are hardly ever any issues with the infrastructure. One thing we are becoming more and more convinced of is that operating your own kubernetes cluster is a full time job in itself, so outsourcing that is worth the cost.

We typically have hundreds of concurrent user pods running at any given moment and total nearly 100000 launches per week.

2 Likes

Thanks for this post!

Iā€™m now looking to setup a jupyterhub/binderhub combination on GCP for classes running anywhere from 10-100 people with deep learning capabilities, and was thus planning to use k8s and 0-to-jupyterhub.

It seems that the startup costs are high (> 200/month) (for the ā€œcoreā€ pods) but that post that, because of the culling, the user pods are not so much on a per user basis.

Would this be an accurate characterization?

(Just checked for n1-standard-2 on GCP and 2 of these for the core pods roughly cost 90/month so it would seem thats the cheapest one could get this down toā€¦)

Depending on your needs, the ā€˜coreā€™ pods may not need much resources at all. n1-highmem-4 is plenty to run the core pods handling hundreds of launches an hour on mybinder.org. If you have relatively low traffic, I bet youā€™d be fine with one n1-highmem-1 for all the core pods. It really takes very little to handle everything but the user servers and builds unless traffic is pretty high. Prometheus is what uses the most resources by far of mybinder.orgā€™s core pods, but that resource usage is proportional to the number of launches.

Iā€™ve run a Hub for a workshop with ~50 users for a couple of weeks and I use n1-highmem-2 for core, which ends up with loads of headroom.

Thanks!

Should I not be having at least 2 nodes because of failures? I dont find a highmem-1 any more, but highmem-2 would seem to then fit the bill (and as you say there will be spaceā€¦)

Any thoughts on n1-standards? Iā€™m guessing too low memory? (even n1-standard-2?)

In all of these scenarios, one is looking at about 90/month for core nodes if one does 2, and about 45 if one does 1ā€¦

Just want to make sure that computes with others experience hereā€¦and that i am not out to lunch :slight_smile:

I would do 2x n1-standard-1 or 1x 2. 2 cpus is plenty. Core nodes donā€™t do a lot, so I donā€™t feel the need for redundancy. mybinder.org only has one core node and itā€™s always the user nodes that have problems (this is why we separate core and user nodes).

You should also check out this blog post from @jhamman and the pangeo team:

it goes into some nice details about their setup

1 Like

Thanks a ton! I just happened to read it thanks to you on twitter :slight_smile: :slight_smile:

1 Like

Thanks for the advice! Will go forone of those 2 options. Still a bit confused about uptimeā€¦ I realize pods will get recreated on death, so from that point you are safe even on one node. Is GCP uptime/rebuild-time so good that you dont worry at all about what happens if the node goes down?