Hub receives thousands of /hub/error/503 messages when jupyter pod is OOMKilled

We’ve recently updated our host node AMIs to Amazon Linux 2023, and our EKS to 1.32, so it looks like cgroupv2 is in full effect.

Basically we’re having an issue where a single user will log anywhere from about 2k~10k logs when their pod experiences an OOMKilled event. All of these /hub/error/503?url= logs are various api calls (/api/kernels, /api/terminals, /ai/chat, etc), effectively any polling the user had happening. From what I can tell that pretty much means it’s an ECONNRESET or ECONNREFUSED error, which makes sense since the pod is being forcibly restarted. These events overwhelm the hub, understandably, sometimes impacting service for other users (since the hub will also be forced to restart, if the liveness probes fail as a result of these events).

We had never observed user pods hitting OOM before - usually the kernel would die and save the pod from having to do that. Now with cgroupv2, that doesn’t seem possible anymore. Really just wondering if anyone else has experienced this, or if there’s any advice as to how we should start picking this apart. Thanks!

Hub: 3.1.2
Lab: 4.1.8

Do you have any monitoring (e.g. Prometheus, or AWS CloudWatch)? If so try comparing the pod and node memory usage for the old and new clusters, this will hopefully tell you whether memory usage of the user pods increased, or whether it’s the same but pods were incorrectly being allowed to use more memory, or if something else is using more memory.

Your could also try running up a standalone jupyterlab pod on EKS, and see if there’s a particular configuration that causes the OOM.