What is the minimum VM spec I can expect to get away with for the non-user parts of JupyterHub / Kubernetes?
I mean - let’s say I have a user-pool, with some standard usable machines on it, and a default-pool, on which I am running the always-on part of the system - such as the hub, the proxy, the image puller and the scheduler. What is the minimum spec I can expect to get away with for the machines supporting the always-on components?
For example, let’s say I have a small class - of 40 students or so - so the maximum simultaneous number of users hitting the system at the same time will be somewhere less than 40. Can I get away with always-on machine type in the range of a Google f1-micro (20% of 1 CPU, 0.6 GB memory) or g1-small (50% of 1 CPU, 1.7 GB memory)?
For the core nodes where the hub and configurable-hub-proxy run it kind of depends on your spawner (assuming KubeSpawner with z2jh) and how you’re doing auth, networking, storage, and database right? For reference, our core nodes are 8CPUx32GB RAM and we never have resource issues with those because the hub doesn’t really go above 1CPU. I’ve got 2K notebook pods in a testing environment right now and the hub CPU usage is averaging about .5 with RAM <512MB. The scale testing I’m doing  is very mechanical though, i.e. there isn’t a lot of random user creates or deletes. Each node (core, user, whatever) gets a certain standard set of vendor (IBM Cloud Kubernetes Service) and system pods though for networking, helm (tiller), logging and statsd metrics for monitoring. So our core nodes are under-utilized and the limiting factor on our user nodes is based on the CPU/RAM guarantees for the singleuser notebook pods (oh and keep in mind you have to kill bitcoin miners on your user nodes depending on how open your hub is).
For my university, I’m running a standard zero-to-jupyterhub Kubernetes setup with Globus authentication.
I guess the the question is - where am I going to notice resource limits, if I go too low on my core spec? I suppose this will be when I have multiple simulataneous log in requests, such as when a large class is starting? Do you have any maximum load estimates? Do you think it is possible that I can get away with 1 CPU / 4GB VMs for the core - from what you’ve said? And where would I look for trouble?
First off, there is a lot in this thread that is probably answered already in this other thread  in more detail by more knowledgeable people than myself.
Having said that, and speaking generally, your core node specs depend on what’s running on the core nodes (are you going to be using sqlite locally on the hub node or a remote database like postgresql?), including logging, monitoring, and other system pods, and how many resources those require. I probably wouldn’t go so bare bones on the core node as to go with a 1x4 node, but you could start there if you want, do some load testing and then scale up if you are hitting issues. For reference, here is the CPU and RAM used graphs from the core node in our testing cluster when I had pushed 3K users/pods onto it yesterday and then drained those all down last night before bed:
The big hill is the hub usage as those 3K pods were added over the course of the day and then purged starting around 9pm last night. The rest of the steady state lines are other system pods on the core node.
Similarly for memory used the big hill is the hub and after that is the user-scheduler and auto-scaler which remain fairly constant. The interesting thing about the hub memory usage is that it stayed high even after I had purged users and pods which ended around midnight last night, so it seems the hub is not clearing out some cache (I know the hub caches User objects but it should be clearing those out when the user is deleted, I need to investigate that with py-spy). Memory usage on the hub only dropped after the hub was restarted around 5:40AM when we rolled out a new user image build in this cluster.
As for load indications during a high load event, for the hub I watch CPU usage and API response rate. Here is a graph of the latter during my load testing yesterday:
Unsurprisingly response times go up as the number of pods are added to the system because those pods are all POSTing their activity back to the hub every 5 minutes (by default). The big spike on response times at the end of last night was when I was purging users/pods which has an arbitrary 10 second slow_stop_timeout (by default) so it took me about 2 hours to delete 3K pods/users. The error rate is mostly on 429 errors because of the concurrent_spawn_limit. This is expected and good design in the hub and my client side script handles it with an exponential backoff retry.
For the user node specs it probably all depends on the resource guarantees and limits you put on the notebook pods and what your users are going to be doing. For example, we noticed we were not killing bitcoin mining activity because users were complaining that their notebooks were running very slowly. When we investigated it turned out a handful of user pods were hogging all of the CPU and starving out their neighbors. If your hub will be closed then you might not need to worry about that, but it doesn’t mean you can’t think about putting limits on your user pods.