[Request for Implementation] JupyterHub aware kubernetes cluster autoscaler

JupyterHubs used for teaching have a significant cost advantage when running on Kubernetes - autoscaling. You can pay only for compute that you use, and ideally it will automatically scale up / down when needed.

In practice, this causes a few issues.

Too many users, too fast

That’s about ~450 users in 8 minutes, logging on right after a lecture. Assuming an (optimistic) maximum of about 80 users per node, that’s about 6 full nodes. If we were to add capacity for all these as cost effectively as possible, our cluster will need to:

  1. Start 6 new nodes, wait for them to boot & join the kubernetes cluster
  2. Pull the big user image (~11G in this case) to the node
  3. Start new users on this node

No current cloud provider can actually do this in that time frame.

With z2jh’s placeholder pods, this situation improves significantly - you can have ‘headroom’ of about 2 emptyish nodes. So instead of needing to create 6 nodes in that time period, you only need to create 4.

Unfortunately no cloud provider is fast enough for this either!

So we have to over provision and run capacity we don’t need at all times. This reduces the cost benefits of autoscaling on a kubernetes cluster significantly.

You can see that even with the placeholder pods, we’re nowhere near using the cluster efficiently! While there are other issues with scale down, needing this much headroom at such short notice causes a bunch of these issues.`

Possible solutions?

The cluster autoscaler doesn’t know anything particular about JupyterHub or users, and only adds new nodes when the cluster is completely full - in the previous graph, that’s user pods + placeholder pods put together. With the JupyterHub use case, we can do better. Some approaches are:

  1. Time based scale-up. Often, you know when your big classes are, and can scale-up explicitly just before classes start. This could possibly be even a google calendar that others can update. We can try scaling down after classes too, but that’s probably left to a different time. This could also be determined by looking at historical data of users, but that’s not very foolproof.
  2. ???

I’m sure there are more solutions here that I can’t think of yet :smiley: Thoughts on what else we could do here?

3 Likes