These docs suggest that, with more students, you need larger VMs (more CPUs and memory), but I wasn’t sure why. I think this is because I don’t understand the autoscaling model.
I wouldn’t say it’s needed, generally, but it may be desirable. It doesn’t generally have a cost effect, though.
One reason for picking bigger nodes can be quotas. A cloud account often has a limit on the number of VMs it can run at once (also on the amount of CPUs and RAM, separately). So if you have a limit of 16 VMs but 256 CPUs, your 1-CPU-per-node cluster limit is 16 CPUs, but your 16-CPU-per-node cluster limit is 256.
Another reason is that you don’t want too many autoscale events, since launching is slow when that happens. The smaller your nodes, the smaller your “unused overhead” capacity that you are paying for but not using. The bigger your nodes, the fewer slowdowns you have when new capacity is needed. It’s also just easier to manage an 8-node cluster than a 32 node cluster with the same CPU/RAM limits.
Is it easy to predict how many pods will fit on a node?
I’m guessing that there are default values used by the scheduler to decide if the node has enough spare memory and CPU to accommodate another pod - so if I know those values, I can predict how many pods fit on a node.
Yes, kubernetes nodes have “capacity” in a few fields:
- memory
- cpu
- max pod count (usually 110, regardless of node size)
- other resources can be limited per node, such as attached volumes, GPUs, etc. which can vary according to cloud provider
and typically only one of these capacities will be the limiting factor (probably RAM). If you guarantee 500M RAM per user on an 8GB node, you will get up to 16 users on there, depending on what else is reserving resources.
Kubernetes cluster scalers allocate a new node when a pod is requested but can’t reserve its “guaranteed resources” on the existing nodes. Capacity only takes into account resource guarantees, whereas resource limits and usage are not considered in scheduling at all. So load is completely irrelevant to scheduling, only reservations of resources. That makes it easy to calculate what Kubernetes will assign to the node, but harder to do ‘real’ load-based scheduling.
In typical cases, most JupyterHub users spend a whole lot of time being idle (reading or writing code, not running it), which means CPU guarantees can be quite low to zero. As @choldgraf mentioned, vastly oversubscribing CPUs is generally fine since the cost of 200% CPU usage is a bit of slowness, while the cost of 101% memory usage can be all kinds of failures.
The exception is if it is highly likely that you are going to have many users actually running CPU-intensive work at the same time (for example: an in-the-room machine learning workshop). In this sort of case, CPU reservations start to make a lot more sense.
The approach I typically take:
- assign a RAM guarantee based on ‘average’ usage. Often small, e.g. 500MB
- assign a RAM limit beyond which you don’t want to allow users to go, limiting their ability to cause trouble for other users on the same node. e.g. 2GB. You don’t want the guarantee and limit to be too far from each other, since that makes it easier for users to cause trouble by overbooking.
- pick a node memory size such that my cluster will be ~5-15 nodes when everybody is active (mem_per_node ~= mem_guarantee * n_concurrent_users / 10) (reasoning: limits idle capacity to 10-20%, plus it’s just a manageable size for a cluster)
- based on users per node, pick a CPU/memory ratio based on how much you expect users to actually be running code at the same time
- set a CPU limit accordingly (1 is fine for a lot of deployments, I often do 2)
- think about how many users taking up the given limits it would take to cause problems on the node and maybe re-evaluate some combination of limits, guarantees, or node type.
Let’s take an example where I want to reserve 512MB per user with a 2GB upper limit and I’ll have a max of ~100 users at the same time. Step 3 gives mem_per_node = 512MB * 100 / 10 = 5GB, so picking from node flavors, I could fit on 13 4GB nodes or 7 8GB nodes. If I pick 8GB nodes on GKE, that’s ~16 users/node. On GKE, I can choose from three ratios: 1 core per 2, 4, or 8 users, or even mix my own CPU:RAM if I want. If I had 1 core per 4 users, I’d probably set the CPU limit at 1.
One of the nice things about Kubernetes is that it’s easy to change these resource guarantees over time as needs change. So if you aren’t confident in your estimates of resource usage, start conservative (high guarantee == limit) and then monitor actual load for a while (e.g. kubectl top pod
and kubectl top node
or metrics in prometheus/grafana) and lower the reservations over time until it matches what folks actually use. Changing node flavor isn’t quite as easy, but it is not usually a big task to add a new node pool of a different flavor and cordon and scale-down an old one. We’ve re-evaulated this periodically in mybinder.org, including this analysis of current load with estimates of how bad it could get if folks actually used the limits instead of the fitting within the guarantees.
yes, starting a new node can take up to a few minutes, depending on the size of user images, disk speed, etc. The placeholder pods mitigate this by requesting a new node before you strictly need it. The earlier you request the node, the less likely it is to have a performance hit when users start needing it, but it also means you start paying for the node before you ‘need’ it because you are assuming a user will need it ‘soon.’
A whole additional hitch in cost calculations on cloud services are sustained- and committed- use discounts. These discounts mean there can be a benefit to having fixed baseline capacity that you do not scale-down, because the more stable your usage, the lower the costs.