Minimum specs for JupyterHub infrastructure VMs?

First off, there is a lot in this thread that is probably answered already in this other thread [1] in more detail by more knowledgeable people than myself. :slight_smile:

Having said that, and speaking generally, your core node specs depend on what’s running on the core nodes (are you going to be using sqlite locally on the hub node or a remote database like postgresql?), including logging, monitoring, and other system pods, and how many resources those require. I probably wouldn’t go so bare bones on the core node as to go with a 1x4 node, but you could start there if you want, do some load testing and then scale up if you are hitting issues. For reference, here is the CPU and RAM used graphs from the core node in our testing cluster when I had pushed 3K users/pods onto it yesterday and then drained those all down last night before bed:

The big hill is the hub usage as those 3K pods were added over the course of the day and then purged starting around 9pm last night. The rest of the steady state lines are other system pods on the core node.

Similarly for memory used the big hill is the hub and after that is the user-scheduler and auto-scaler which remain fairly constant. The interesting thing about the hub memory usage is that it stayed high even after I had purged users and pods which ended around midnight last night, so it seems the hub is not clearing out some cache (I know the hub caches User objects but it should be clearing those out when the user is deleted, I need to investigate that with py-spy). Memory usage on the hub only dropped after the hub was restarted around 5:40AM when we rolled out a new user image build in this cluster.

As for load indications during a high load event, for the hub I watch CPU usage and API response rate. Here is a graph of the latter during my load testing yesterday:

Unsurprisingly response times go up as the number of pods are added to the system because those pods are all POSTing their activity back to the hub every 5 minutes (by default). The big spike on response times at the end of last night was when I was purging users/pods which has an arbitrary 10 second slow_stop_timeout (by default) so it took me about 2 hours to delete 3K pods/users. The error rate is mostly on 429 errors because of the concurrent_spawn_limit. This is expected and good design in the hub and my client side script handles it with an exponential backoff retry.

For the user node specs it probably all depends on the resource guarantees and limits you put on the notebook pods and what your users are going to be doing. For example, we noticed we were not killing bitcoin mining activity because users were complaining that their notebooks were running very slowly. When we investigated it turned out a handful of user pods were hogging all of the CPU and starving out their neighbors. If your hub will be closed then you might not need to worry about that, but it doesn’t mean you can’t think about putting limits on your user pods.

[1] Background for JupyterHub / Kubernetes cost calculations?

1 Like