Improved chp health endpoint?

A large number of write-after-end can be caused by two things:

  • single-user servers being shutdown, waiting for the Hub to notify the proxy (e.g. caused by internal idle culling and/or crashes, e.g. shutdown_no_activity_timeout). These may cause load on CHP (this is what prompted Improvements when things go wrong by minrk · Pull Request #290 · jupyterhub/configurable-http-proxy · GitHub), apparently due to expensive log statements in what should not be considered a particularly significant event. These events should not be considered a health problem.
  • cluster/container network issue where CHP can’t talk to anything - this can reasonably be considered a health indicator. How exactly to measure this is not clear.

Since you restarted both the hub and proxy, I’m not 100% sure the proxy pod was unhealthy. If, for instance, the Hub failing to notice that a server had died, and proxying lots of failed requests (e.g. from a left-open jupyterlab tab or other app that tries to reconnect) was the source of the problem, the direct fix may have been removing the route. In that case, restarting the Hub pod, which reconciles the proxy routing table with running pods, may have in fact been the fix. Restarting only the proxy pod wouldn’t do this, as the hub pod would re-establish the same routing table as when the proxy pod went down.

So there’s two things here:

  1. the cost of the proxy sending requests to inaccessible endpoints (improved in CHP 4.3.0), and
  2. the reason for the endpoints staying in the routing table in the first place, after the pods become unavailable/stopped, which could be due to:
    a. KubeSpawner failing to notice the container stopping, or
    b. jupyter_server/notebook server failing to finish exiting, so the container is stuck ‘running’ but not actually working (should we have a liveness probe here?), or
    c. JupyterHub failing to cleanup after the Spawner (unlikely without evidence)

That might be able to explain what you experienced without the proxy pod ever actually being ‘unhealthy’. I think 2.b. is what we’ve been seeing on mybinder.org, a liveness probe on user pods that might be the fix you are looking for.

1 Like