Reliability practices around Z2JH deployments

ian-r-rose · August 20, 2019, 7:31pm

I have a Z2JH-based deployment on AWS, and am trying to figure out some better practices around maintaining uptime for users. In particular it happens somewhat regularly that a node will crash (due a user trying to load too much data, for instance), and EKS will not be able to gracefully recover. So I guess I am looking for some strategies for how to handle this case:

If there is a crash, what is the least-destructive way to restart a node? Most places I have seen suggest either using kubectl delete node or logging into the node instance and restarting kubelet, but I’m having a hard time coming up with anything authoritative.
If a node crashes, users whose pods existed on that node won’t be able to log in using a different node due to a mismatch in the persistent volume affinity. Is there a way around this restriction? It seems kind of antithetical to the point of k8s that a particular resource is so tied to a particular machine.
Am I doing anything wrong or dumb that this has become a somewhat regular occurrence?

Any advice would be appreciated. Thanks!

Topic		Replies	Views
How to have JupyterHub configuration and recover back Zero to JupyterHub on Kubernetes how-to	3	41	October 24, 2024
Core component resilience/reliability JupyterHub	10	2022	September 11, 2020
K8s nodes going bad? Zero to JupyterHub on Kubernetes	0	589	February 5, 2020
Z2JH singleuser pods not surviving hub outage? Zero to JupyterHub on Kubernetes	1	35	December 16, 2024
Shutdown/startup for z2jh JupyterHub	0	306	November 24, 2020

Reliability practices around Z2JH deployments

Related topics