I have a Z2JH-based deployment on AWS, and am trying to figure out some better practices around maintaining uptime for users. In particular it happens somewhat regularly that a node will crash (due a user trying to load too much data, for instance), and EKS will not be able to gracefully recover. So I guess I am looking for some strategies for how to handle this case:
- If there is a crash, what is the least-destructive way to restart a node? Most places I have seen suggest either using
kubectl delete nodeor logging into the node instance and restarting kubelet, but I’m having a hard time coming up with anything authoritative.
- If a node crashes, users whose pods existed on that node won’t be able to log in using a different node due to a mismatch in the persistent volume affinity. Is there a way around this restriction? It seems kind of antithetical to the point of k8s that a particular resource is so tied to a particular machine.
- Am I doing anything wrong or dumb that this has become a somewhat regular occurrence?
Any advice would be appreciated. Thanks!