Planning placeholders with jupyterhub helm chart 0.8 (tested on mybinder.org)

tip

#1

The main new features of the upcoming 0.8 release of the jupyterhub helm chart are related to improvements and control in scheduling and scaling. With Kubernetes 1.11 enabling pod priorities, the jupyterhub chart can run a number of low-priority “placeholder” pods that request the same resources as user pods, with one important difference: when a ‘real’ user pod launches and resources are full, a placeholder will get evicted immediately, due to its low priority. Having pods that can’t find anywhere to run is what causes the autoscaler to request a new node. The goal here is to prompt the autoscaler to request a new node slightly before any real user needs it, reducing the delay in launch seen by users, which can be up to several minutes if they are unlucky.

Here is a snapshot of a scale-up event at around 07:00 on mybinder.org, before placeholders were deployed:

We can see there’s a huge increase in launch time and even a few failures as users wait for the new node to warm up! This is precisely what the placeholders are meant to help with. The question becomes: how many placeholders do I need?

We recently deployed placeholders on mybinder.org, and I’ll use our grafana charts to see if we can pick a reasonable number.

First, some numbers on the deployment at mybinder.org:

  • we tend to have ~200-400 users at a time
  • we typically launch 75-150 pods in a given 10 minute span
  • current resource allocations put ~100 users on each node

We started with 10 placeholders, just as a place to start.

So what should it look like if the placeholders are doing their job? What is the ‘ideal’ sequence of events for placeholders?

  • Starting with a full cluster, we have all our placeholder pods running, taking up resources.
  • A new user shows up and doesn’t fit. Kubernetes evicts one low-priority placeholder pod to make room
  • now that there is a pod with nowhere to fit (and one we don’t care about starting promptly since it’s just a placeholder!), the autoscaler requests a new node to make room for it
  • while that new node is warming up, users keep showing up, kicking placeholders off of the nodes, one after another
  • the new node should become ready before the first ‘real’ user pod can’t be scheduled on the existing nodes

So what does that mean for us? It means that the number of placeholder pods should be roughly the number of users who typically launch during the time it takes for a new node to be requested and ready. How do we figure this out?

Let’s look at a recent scale-up event with our 10 placeholder pods:

We see the following info:

  1. all 10 placeholder pods are evicted within 15 seconds around 09:57
  2. placeholder pods start coming back at 09:58, and finish coming back at 09:59

Because there was a span where no placeholder pods were running, it is very likely that some users faced a delay due to the new node. We can see this by looking at launch times around this event:

Note; launch time increase is observed slightly after the autoscale event, since launch times are registered when the launch finishes, not when it starts

and indeed, there was a significant rise in launch time. We can make a useful point here: even though there is still a rise in launch time, it’s already much better than before the placeholders (these autoscale events were only one day apart, before and after enabling the placeholders). Any number of placeholders will reduce the cost of a scale-up event. You don’t have to pick the exact right number. Any number will make it better, but we want to pick a number such that we’ve ~always got at least one placeholder pod running, because any time a placeholder is running that should mean that no users are waiting for a spot to open up.

Based on the launch rate of ~100/10 minutes:

and the time to launch the node of slightly over 2 minutes, we can get a new estimate:

2 minutes * 100 launches / 10 minutes = 20 placeholders. To be safe, we will try increasing the placeholder count to 25 and see how it goes!

Of course, if you have cloud budget to burn, you could use placeholders to always have a fully empty node by setting the placeholder count to the maximum number of users that fit on a node (100 for mybinder.org). That way you can be pretty sure a node is always ready for your users, but it also means you are always paying for a node you aren’t using.

To enable placeholder pods on your deployment:

    scheduling:
      podPriority:
        enabled: true
      userPlaceholder:
        enabled: true
        replicas: 10

#2

Nice work!

Before reading this post my answer to “how long does it take a new node to be fully ready” would have been ~8minutes. A few (around 2) of which are the node being provisioned and the rest waiting for the first few images to be fully pulled from the registry. So maybe we need even more placeholders? Luckily we can just watch mybinder.org :smiley:

Thinking about having 20 (or even more) placeholder pods: can we increase the resources “hogged” by a placeholder pod to N times what a user pod gets as a way to reduce the number of placeholder pods (taking up space in kubectl get pods output and such)? Then we tune the resources allocated to the pod instead of the number of pods.


#3

We cloud do that, but then it wouldn’t consume e.g. the pod count resource.


#4

I think the placeholders may improve the responsiveness of new nodes because they are all only requesting the probably-already-present-but-tiny-if-not pause image. One of the biggest costs of a new node on Binder is the fact that even in just two minutes, ~20-30 users could be waiting to launch 20 different 8GB images as soon as the node becomes ready. This means that there’s instantly tons of contention on docker pulls on the new node. Placeholders allow the node to be fully ready and waiting to pull as soon as the first user shows up on it.

It should also be noted that the new node is still going to result in a slowdown of spawns as it warms up, since it has to pull fresh images. But at least starting from 0 should be better than starting with a backlog of 10-30 simultaneously queued pulls.


#5

Short summary
Enabling user-placeholders requires you to also enable podPriority and use a cluster autoscaler. At this point activating the user scheduler allows you to scale down better without.

Why and How?

It is the user placeholders purpose is to allow the cluster autoscaler to add a node ahead of time to not have actual users waiting for this process, so it only makes sense to enable user-placeholders if you have a cluster autoscaler. Now, if you have a cluster autoscaler to scale up, you should have the userScheduler enabled allowing you to scale down more efficiently!

scheduling:
  podPriority:
    enabled: true
  userPlaceholder:
    enabled: true
    replicas: 10
  userScheduler:
    enabled: true

The userSchedulers task is to make sure the user pods are scheduled to fill up one node at the time. Without it, the default scheduler will instead spread out the pods on the available nodes. Why is filling up nodes one at the time better when a cluster autoscaler is active? Because it allows nodes to free up quicker, and that in turn allows the cluster autoscaler to scale down nodes saving money!


#6

That would come with multiple drawbacks, so to avoid showing the user-placeholder pods in queries with kubectl I would suggest using selectors to filter them out instead. The user-placeholder pods has the label: component: user-placeholder allowing us to single them out.

Example:

# not user-placeholder
kubectl get pods -l component!=user-placeholder

# actual users only
kubectl get pods -l component=singleuser-server

Drawbacks of single user-placeholder pod:

  • Somewhat harder to do this practically, you would change the pod definition and attempt to multiply things in helm that are strings (example: how to multiply the strings “2GiB”, “2G”, “2”?).
  • You may be unable to fit a large placeholder on a node, not allowing the CA to scale up at all because it is too large
  • You may be unable to fit a large placeholder on two semi used nodes while it would fit if it was split up.

With this in mind, I think sticking to this way is better. But, I think it could make sense to allow for custom resource specification for various user-placeholders though. This could be relevant if you have multiple types of nodes, GPU nodes vs CPU only nodes for example.


#7

Thanks for the thoughts on “one big placeholder” vs “user sized placeholders”. I think user sized placeholders make a lot of sense and will augment my kubectl to hide them :stuck_out_tongue: