Improved chp health endpoint?

minrk · May 26, 2021, 2:33pm

A large number of write-after-end can be caused by two things:

single-user servers being shutdown, waiting for the Hub to notify the proxy (e.g. caused by internal idle culling and/or crashes, e.g. shutdown_no_activity_timeout). These may cause load on CHP (this is what prompted Improvements when things go wrong by minrk · Pull Request #290 · jupyterhub/configurable-http-proxy · GitHub), apparently due to expensive log statements in what should not be considered a particularly significant event. These events should not be considered a health problem.
cluster/container network issue where CHP can’t talk to anything - this can reasonably be considered a health indicator. How exactly to measure this is not clear.

Since you restarted both the hub and proxy, I’m not 100% sure the proxy pod was unhealthy. If, for instance, the Hub failing to notice that a server had died, and proxying lots of failed requests (e.g. from a left-open jupyterlab tab or other app that tries to reconnect) was the source of the problem, the direct fix may have been removing the route. In that case, restarting the Hub pod, which reconciles the proxy routing table with running pods, may have in fact been the fix. Restarting only the proxy pod wouldn’t do this, as the hub pod would re-establish the same routing table as when the proxy pod went down.

So there’s two things here:

the cost of the proxy sending requests to inaccessible endpoints (improved in CHP 4.3.0), and
the reason for the endpoints staying in the routing table in the first place, after the pods become unavailable/stopped, which could be due to:
a. KubeSpawner failing to notice the container stopping, or
b. jupyter_server/notebook server failing to finish exiting, so the container is stuck ‘running’ but not actually working (should we have a liveness probe here?), or
c. JupyterHub failing to cleanup after the Spawner (unlikely without evidence)

That might be able to explain what you experienced without the proxy pod ever actually being ‘unhealthy’. I think 2.b. is what we’ve been seeing on mybinder.org, a liveness probe on user pods that might be the fix you are looking for.

Topic		Replies	Views
Jhub Proxy fails after kernel crash on single user pod JupyterHub community , jupyterhub , how-to , help-wanted	2	746	June 28, 2021
Proxy loses track of singleuser servers after k8s restarts them JupyterHub	3	577	February 27, 2019
Z2jh configurable-http-proxy blocking responses from remote kernel manager Zero to JupyterHub on Kubernetes	1	805	June 12, 2024
Unhealthy hub pod, possible network problems? Zero to JupyterHub on Kubernetes help-wanted	1	300	November 8, 2023
Scheduling errors with z2jh 0.10.x JupyterHub	1	436	February 5, 2021

Improved chp health endpoint?

Related topics