Core component resilience/reliability

I’m setting up JupyterHub on K8s in a large-scale enterprise environment and anticipate thousands of concurrent users.

Is there any documentation on improving the resilience/reliability of the core components (proxy, hub, and spawner)?

I’d like to make sure that, at a bare minimum, I have at least one backup pod for core component.

I’m also curious about what the common failure points are when traffic exceeds a certain threshold.

Thank you!