We’ve recently updated our host node AMIs to Amazon Linux 2023, and our EKS to 1.32, so it looks like cgroupv2 is in full effect.
Basically we’re having an issue where a single user will log anywhere from about 2k~10k logs when their pod experiences an OOMKilled event. All of these /hub/error/503?url= logs are various api calls (/api/kernels, /api/terminals, /ai/chat, etc), effectively any polling the user had happening. From what I can tell that pretty much means it’s an ECONNRESET or ECONNREFUSED error, which makes sense since the pod is being forcibly restarted. These events overwhelm the hub, understandably, sometimes impacting service for other users (since the hub will also be forced to restart, if the liveness probes fail as a result of these events).
We had never observed user pods hitting OOM before - usually the kernel would die and save the pod from having to do that. Now with cgroupv2, that doesn’t seem possible anymore. Really just wondering if anyone else has experienced this, or if there’s any advice as to how we should start picking this apart. Thanks!
Hub: 3.1.2
Lab: 4.1.8