The main new features of the upcoming 0.8 release of the jupyterhub helm chart are related to improvements and control in scheduling and scaling. With Kubernetes 1.11 enabling pod priorities, the jupyterhub chart can run a number of low-priority “placeholder” pods that request the same resources as user pods, with one important difference: when a ‘real’ user pod launches and resources are full, a placeholder will get evicted immediately, due to its low priority. Having pods that can’t find anywhere to run is what causes the autoscaler to request a new node. The goal here is to prompt the autoscaler to request a new node slightly before any real user needs it, reducing the delay in launch seen by users, which can be up to several minutes if they are unlucky.
Here is a snapshot of a scale-up event at around 07:00 on mybinder.org, before placeholders were deployed:
We can see there’s a huge increase in launch time and even a few failures as users wait for the new node to warm up! This is precisely what the placeholders are meant to help with. The question becomes: how many placeholders do I need?
First, some numbers on the deployment at mybinder.org:
- we tend to have ~200-400 users at a time
- we typically launch 75-150 pods in a given 10 minute span
- current resource allocations put ~100 users on each node
We started with 10 placeholders, just as a place to start.
So what should it look like if the placeholders are doing their job? What is the ‘ideal’ sequence of events for placeholders?
- Starting with a full cluster, we have all our placeholder pods running, taking up resources.
- A new user shows up and doesn’t fit. Kubernetes evicts one low-priority placeholder pod to make room
- now that there is a pod with nowhere to fit (and one we don’t care about starting promptly since it’s just a placeholder!), the autoscaler requests a new node to make room for it
- while that new node is warming up, users keep showing up, kicking placeholders off of the nodes, one after another
- the new node should become ready before the first ‘real’ user pod can’t be scheduled on the existing nodes
So what does that mean for us? It means that the number of placeholder pods should be roughly the number of users who typically launch during the time it takes for a new node to be requested and ready. How do we figure this out?
Let’s look at a recent scale-up event with our 10 placeholder pods:
We see the following info:
- all 10 placeholder pods are evicted within 15 seconds around 09:57
- placeholder pods start coming back at 09:58, and finish coming back at 09:59
Because there was a span where no placeholder pods were running, it is very likely that some users faced a delay due to the new node. We can see this by looking at launch times around this event:
Note; launch time increase is observed slightly after the autoscale event, since launch times are registered when the launch finishes, not when it starts
and indeed, there was a significant rise in launch time. We can make a useful point here: even though there is still a rise in launch time, it’s already much better than before the placeholders (these autoscale events were only one day apart, before and after enabling the placeholders). Any number of placeholders will reduce the cost of a scale-up event. You don’t have to pick the exact right number. Any number will make it better, but we want to pick a number such that we’ve ~always got at least one placeholder pod running, because any time a placeholder is running that should mean that no users are waiting for a spot to open up.
Based on the launch rate of ~100/10 minutes:
and the time to launch the node of slightly over 2 minutes, we can get a new estimate:
2 minutes * 100 launches / 10 minutes = 20 placeholders. To be safe, we will try increasing the placeholder count to 25 and see how it goes!
Of course, if you have cloud budget to burn, you could use placeholders to always have a fully empty node by setting the placeholder count to the maximum number of users that fit on a node (100 for mybinder.org). That way you can be pretty sure a node is always ready for your users, but it also means you are always paying for a node you aren’t using.
To enable placeholder pods on your deployment:
scheduling: podPriority: enabled: true userPlaceholder: enabled: true replicas: 10