Writing custom schedulers for kubernetes

Just wondering: has someone (tried to) write a custom kubernetes scheduler? I’m interested in finding examples and guides on doing this for BinderHub so any points and experience would be welcome.

I’ve found https://banzaicloud.com/blog/k8s-custom-scheduler/ which looks interesting.

Some background on why I am interested in this:

We currently use the default kube-scheduler with a custom config for user pods. The main effect this has is to schedule user pods on the “fullest” node (that has room to spare). This helps a lot with keeping our nodes busy and allows us to scale down nodes that are not needed. This in turn lets us spend less on compute.

One drawback of this strategy is that if a group of (say) 30 people launch the same repo at the same time there is a high chance they all end up on the same node. As mybinder.org is used a lot for courses and workshops chances are that all these 30 people will do things in unison, including running “heavy compute”. This is when we see a high load on the node that they are on because, in the interest of keeping our node utilisation up, we over-commit our nodes.

Most of the time no one notices that ~90 users are sharing the 8 cores of a node because humans spend a lot of time reading and thinking about what it is the computer just computed for them. Except in classroom settings with some “heavy” compute.

You could try to optimize things by adding a tag with the sha1 of the repo url to pods, and add an anti-affinity rule that prefers to schedule to nodes not already running the same tag.

We need something like that in combination with a preference for nodes which are busy. Or some other strategy to make the least full node not eligible. Otherwise we will always schedule someone on the node which is closest to being removed from the pool and it’ll never get removed.

If we can create a strategy from the existing predicates and priority functions that would be great.

Make the least busy node more unattractive (dynamic tag with big weight, or simply a taint) than the same-repo anti-affinity. Only that dynamic tagging needs some custom code.