[Request for Implementation] JupyterHub aware kubernetes cluster autoscaler

JupyterHubs used for teaching have a significant cost advantage when running on Kubernetes - autoscaling. You can pay only for compute that you use, and ideally it will automatically scale up / down when needed.

In practice, this causes a few issues.

Too many users, too fast

That’s about ~450 users in 8 minutes, logging on right after a lecture. Assuming an (optimistic) maximum of about 80 users per node, that’s about 6 full nodes. If we were to add capacity for all these as cost effectively as possible, our cluster will need to:

  1. Start 6 new nodes, wait for them to boot & join the kubernetes cluster
  2. Pull the big user image (~11G in this case) to the node
  3. Start new users on this node

No current cloud provider can actually do this in that time frame.

With z2jh’s placeholder pods, this situation improves significantly - you can have ‘headroom’ of about 2 emptyish nodes. So instead of needing to create 6 nodes in that time period, you only need to create 4.

Unfortunately no cloud provider is fast enough for this either!

So we have to over provision and run capacity we don’t need at all times. This reduces the cost benefits of autoscaling on a kubernetes cluster significantly.

You can see that even with the placeholder pods, we’re nowhere near using the cluster efficiently! While there are other issues with scale down, needing this much headroom at such short notice causes a bunch of these issues.`

Possible solutions?

The cluster autoscaler doesn’t know anything particular about JupyterHub or users, and only adds new nodes when the cluster is completely full - in the previous graph, that’s user pods + placeholder pods put together. With the JupyterHub use case, we can do better. Some approaches are:

  1. Time based scale-up. Often, you know when your big classes are, and can scale-up explicitly just before classes start. This could possibly be even a google calendar that others can update. We can try scaling down after classes too, but that’s probably left to a different time. This could also be determined by looking at historical data of users, but that’s not very foolproof.
  2. ???

I’m sure there are more solutions here that I can’t think of yet :smiley: Thoughts on what else we could do here?

3 Likes

Is there already a standard pre-class scale up command, that the teacher can run manually?

You could:

  1. Set the autoscaling minimum count to your desired count
  2. Resize the nodepool to that number

For google cloud, these should be fairly simple gcloud commands that can be run manually.

But that won’t pull the images - right?. Which changing the placeholders will - but I guess then the dance is a bit complicated because we have to scale those down just as people start joining, to avoid doubling an already large number.

I think there is a “image prepuller” which runs on every node in the cluster. It is a pod that does nothing but “use” the image that the user pods use. This means the image should be on the node a few minutes after it starts up.

Maybe instead of scaling up the auto scaler minimum which is a cloud provider specific command we can provide a script/UI in the hub which changes the number of placeholder pods in the placeholder pod replicaset. This would then trigger a scale up.

This would be provider agnostic which is nice. Placeholder pods have a lower priority than user pods so they would get evicted in order for a user pod to run. But you’d have to reduce the number of placeholder pods again once the class has “started” which might be tricky to know/automate.

Once you’ve built a UI or API for setting the number of placeholders in the hub you could automate it by creating a hub service that runs every minute and:

  1. counts the number of user pods
  2. sets number_of_placeholders = expected_class_size - current_number_of_users
1 Like

Yes - that sounds very useful …

If you have the continuous pre-puller turned on (prePuller.continuous.enabled), spawning new nodes will automatically pull the image.

Thanks Yuvi and Tim - I hadn’t registered that a manual node scale will, by default, pull the images onto the node.

1 Like

Timely post for me Yuvi. We recently ran into this on our first GCP deployment where the prof asks everyone in the class to login. We found that if an admin clicks the start all button a few minutes before, the time it takes a user to login is greatly reduced.

I am making no claims that this is a solution, or even a good thing to do. We are experimenting :slight_smile:

1 Like

More specific thoughts on what an MVP would look like, based on my current experiences.

The code should function as a reconciliation loop. On each iteration, it should:

  1. Read a data source that indicates what the minimum capacity of the cluster should be at current time
  2. If the cluster capacity is lower than what the data source requires, we should set the autoscaler minimum node count to desired capacity and actively request enough nodes to match that. This will make sure the autoscaler doesn’t optimize away possible empty nodes since they aren’t currently in use
  3. If the current cluster capacity is higher than what the data source requires, we set the autoscaler minimum node count to the desired capacity, but we don’t try to automatically remove any nodes explicitly. We leave it to the autoscaler, which might take a while but will not disrupt anyone’s workflow.
  4. Go to step 1

For the MVP, I want the data source to be a google calendar. I made an example here. It indicates that weekends should have a desired node count of 4, and a time around data8 classes M-W-F should have a desired node count of 11. We can grant write access to the calendar to whoever needs to make these determinations, and it also leaves a nice record of why something scaled up. It should also make unit testing easier!

It makes me sad we’re doing time-based autoscaling instead of just having the system dynamically figure it out. But such is life! And we can extend the ‘data source’ here in the future to things that are far more dynamic.

I think this approach should work well. Instead of having placeholder-pods be a single static value, we have two targets to meet:

  • minimum placeholder pods (gives scale-up lead time)
  • minimum user capacity

And the service always sets placeholder pod replica set to max(min_pods, min_user_capacity - current_users)

Once you have that, getting minimum_user_capacity from a time-based source shouldn’t be a big deal.

Here’s a sketch of everything except loading the source of the configuration from somewhere, but it’s at least organized in such a way that it expects the target value to be able to change each iteration: GitHub - minrk/jupyterhub-placeholder-scaler

2 Likes

I got a little carried away today, and I think that sketch now does the whole thing:

  • two configurable targets: min placeholders, min capacity
  • read and parse .ics calendars with lines in their description like min_capacity = 100 for short-term overrides of these values over time so you can set your cluster capacity in google calendar.

Predictably, ~half the time was spent dealing with weird timezone stuff in calendar files and I’m sure it’s still wrong sometimes.

That was a fun day! I couldn’t do a lot else, since my Internet was out all day (still is), so I could only work on things that would reasonably fit in my phone’s tethered cell connection.