Background for JupyterHub / Kubernetes cost calculations?

Hi,

As usual - please forgive my ignorance, but can I ask for help about calculating JupyterHub / Kubernetes costs?

I realized as I read through the docs, that I didn’t understand the relationship between pods / node, machine types and cost:

These docs suggest that, with more students, you need larger VMs (more CPUs and memory), but I wasn’t sure why. I think this is because I don’t understand the autoscaling model.

I assume the default Kubernetes / JupyterHub autoscaler first fills up the existing nodes with as many pods as fit in the node, and then makes a new node. Is that correct?

Is it easy to predict how many pods will fit on a node?

I assume too that it’s generally cheaper to have fewer larger nodes with more pods per node (depending on node price obviously). So, why have smaller nodes for smaller courses? Won’t larger nodes be cheaper even with smaller courses?

Thanks for any insight,

Cheers,

Matthew

Wanna check out some of the resources in the JupyterHub for Kubernetes costs section and see if those answer some questions?

Aha - yes - sorry - I should have said I saw that page too.

As I explore more, I think the primary answer to my question is:

  • The key things predicting price is the amont of {RAM, disk capacity} per user that you allocate, not per node. Choosing the node type, after that, is about finding a suitable number of CPUs per user, relative to the memory per user.

I’m guessing that there are default values used by the scheduler to decide if the node has enough spare memory and CPU to accomodate another pod - so if I know those values, I can predict how many pods fit on a node.

I guess also, that nodes with a lot of memory give faster start up time, because making a pod in a node is cheaper than making a new node.

Cheers,

Matthew

1 Like

Both of those things are true :+1: in general, RAM is the bottleneck*, and adding nodes is almost always slower than adding pods.

* This is partially because if a machine hits its RAM limit, everything stops working. If its CPUs are maxed out, then things just get queued until cycles are available, so you have to be more careful to provision RAM than CPU

1 Like

Yes and no. The default kubernetes scheduler spreads pods around the available nodes. The by now default (I think) changes to the scheduler we make in zero2jupyterhub mean that a fuller-but-not-at-the-limit node is more likely to receive a new pod than a less full node. The idea is to try and pack nodes to the limit and only then move on to the next node.

I’m not sure if you “need to” use larger nodes or if it becomes possible to use larger nodes. Larger nodes are nicer because it reduces the number of times you have to wait for a new node to boot. You can “smooth out” fluctuations in CPU used by pods and number of running pods a bit more. This gets especially interesting if you over-subscribe a node’s CPU. I’d highly recommend doing that because most users who launch a notebook spend most of their time reading&thinking, not running code. What the ratio of “running code” to “thinking” is depends on the use case. It is easier to over-subscribe a larger node than a small one, just because there is more headroom. The maximum CPU you promise to an individual user is a smaller fraction of the total on a larger node than on a smaller node.

This depends on your cloud vendor. I think for Google Cloud and Azure the cost is per CPU and RAM. So two nodes with 2 CPUs each cost exactly the same as one node with 4 CPUs.

If you don’t have the scale to fill and use a larger node you end up paying for resources that you don’t use. This makes it more expensive than using a smaller node where there is less idle capacity if you have fewer users.

2 Likes

These docs suggest that, with more students, you need larger VMs (more CPUs and memory), but I wasn’t sure why. I think this is because I don’t understand the autoscaling model.

I wouldn’t say it’s needed, generally, but it may be desirable. It doesn’t generally have a cost effect, though.

One reason for picking bigger nodes can be quotas. A cloud account often has a limit on the number of VMs it can run at once (also on the amount of CPUs and RAM, separately). So if you have a limit of 16 VMs but 256 CPUs, your 1-CPU-per-node cluster limit is 16 CPUs, but your 16-CPU-per-node cluster limit is 256.

Another reason is that you don’t want too many autoscale events, since launching is slow when that happens. The smaller your nodes, the smaller your “unused overhead” capacity that you are paying for but not using. The bigger your nodes, the fewer slowdowns you have when new capacity is needed. It’s also just easier to manage an 8-node cluster than a 32 node cluster with the same CPU/RAM limits.

Is it easy to predict how many pods will fit on a node?

I’m guessing that there are default values used by the scheduler to decide if the node has enough spare memory and CPU to accommodate another pod - so if I know those values, I can predict how many pods fit on a node.

Yes, kubernetes nodes have “capacity” in a few fields:

  • memory
  • cpu
  • max pod count (usually 110, regardless of node size)
  • other resources can be limited per node, such as attached volumes, GPUs, etc. which can vary according to cloud provider

and typically only one of these capacities will be the limiting factor (probably RAM). If you guarantee 500M RAM per user on an 8GB node, you will get up to 16 users on there, depending on what else is reserving resources.

Kubernetes cluster scalers allocate a new node when a pod is requested but can’t reserve its “guaranteed resources” on the existing nodes. Capacity only takes into account resource guarantees, whereas resource limits and usage are not considered in scheduling at all. So load is completely irrelevant to scheduling, only reservations of resources. That makes it easy to calculate what Kubernetes will assign to the node, but harder to do ‘real’ load-based scheduling.

In typical cases, most JupyterHub users spend a whole lot of time being idle (reading or writing code, not running it), which means CPU guarantees can be quite low to zero. As @choldgraf mentioned, vastly oversubscribing CPUs is generally fine since the cost of 200% CPU usage is a bit of slowness, while the cost of 101% memory usage can be all kinds of failures.

The exception is if it is highly likely that you are going to have many users actually running CPU-intensive work at the same time (for example: an in-the-room machine learning workshop). In this sort of case, CPU reservations start to make a lot more sense.

The approach I typically take:

  1. assign a RAM guarantee based on ‘average’ usage. Often small, e.g. 500MB
  2. assign a RAM limit beyond which you don’t want to allow users to go, limiting their ability to cause trouble for other users on the same node. e.g. 2GB. You don’t want the guarantee and limit to be too far from each other, since that makes it easier for users to cause trouble by overbooking.
  3. pick a node memory size such that my cluster will be ~5-15 nodes when everybody is active (mem_per_node ~= mem_guarantee * n_concurrent_users / 10) (reasoning: limits idle capacity to 10-20%, plus it’s just a manageable size for a cluster)
  4. based on users per node, pick a CPU/memory ratio based on how much you expect users to actually be running code at the same time
  5. set a CPU limit accordingly (1 is fine for a lot of deployments, I often do 2)
  6. think about how many users taking up the given limits it would take to cause problems on the node and maybe re-evaluate some combination of limits, guarantees, or node type.

Let’s take an example where I want to reserve 512MB per user with a 2GB upper limit and I’ll have a max of ~100 users at the same time. Step 3 gives mem_per_node = 512MB * 100 / 10 = 5GB, so picking from node flavors, I could fit on 13 4GB nodes or 7 8GB nodes. If I pick 8GB nodes on GKE, that’s ~16 users/node. On GKE, I can choose from three ratios: 1 core per 2, 4, or 8 users, or even mix my own CPU:RAM if I want. If I had 1 core per 4 users, I’d probably set the CPU limit at 1.

One of the nice things about Kubernetes is that it’s easy to change these resource guarantees over time as needs change. So if you aren’t confident in your estimates of resource usage, start conservative (high guarantee == limit) and then monitor actual load for a while (e.g. kubectl top pod and kubectl top node or metrics in prometheus/grafana) and lower the reservations over time until it matches what folks actually use. Changing node flavor isn’t quite as easy, but it is not usually a big task to add a new node pool of a different flavor and cordon and scale-down an old one. We’ve re-evaulated this periodically in mybinder.org, including this analysis of current load with estimates of how bad it could get if folks actually used the limits instead of the fitting within the guarantees.

yes, starting a new node can take up to a few minutes, depending on the size of user images, disk speed, etc. The placeholder pods mitigate this by requesting a new node before you strictly need it. The earlier you request the node, the less likely it is to have a performance hit when users start needing it, but it also means you start paying for the node before you ‘need’ it because you are assuming a user will need it ‘soon.’

A whole additional hitch in cost calculations on cloud services are sustained- and committed- use discounts. These discounts mean there can be a benefit to having fixed baseline capacity that you do not scale-down, because the more stable your usage, the lower the costs.

3 Likes