Improved chp health endpoint?

Hi, we’re using z2jh 0.10.6 with jupyterhub 1.2.2, kubespawner 0.15 and configurable-http-proxy (chp) 4.2.2 and yesterday the chp went sideways which caused the hub to fail to spawn user notebooks during a large user event. This started around 2:50 CT. We had ~651 active notebook singleuser-server pods before the hub started crashing:


This graph shows the restarts on the hub starting around that same time:


We had plenty of user node capacity so that wasn’t the problem (no auto-scaler kicking in or anything like that). We eventually figured out the proxy was having issues, that correlates with errors in the logs from the chp starting around the same time:

Here is an example of a common error seen in the chp logs:

May 24 15:54:08 proxy-7fc546c64c-pfbjw chp error 20:54:08.752 [ConfigProxy] error: Uncaught Exception: write after end
May 24 15:54:08 proxy-7fc546c64c-pfbjw chp error 20:54:08.753 [ConfigProxy] error: Error [ERR_STREAM_WRITE_AFTER_END]: write after end
    at writeAfterEnd (_stream_writable.js:266:14)
    at Socket.Writable.write (_stream_writable.js:315:5)
    at IncomingMessage.<anonymous> (/srv/configurable-http-proxy/lib/configproxy.js:458:30)
    at IncomingMessage.emit (events.js:314:20)
    at (_stream_readable.js:508:10)
    at flow (_stream_readable.js:1008:34)
    at resume_ (_stream_readable.js:989:3)
    at processTicksAndRejections (internal/process/task_queues.js:84:21)

And this:

20:10:04.132 [ConfigProxy] error: 503 GET /user/5f3cb2ffdbb8380013400afe/api/metrics/v1 socket hang up 

This graph also shows the rate of 500+ errors from the proxy during that window:

Things got better after ~3:55pm CT when we restarted the proxy deployment and then we also restarted the hub for good measure.

We think the active user pods were idling out after an hour of inactivity (we have c.NotebookApp.shutdown_no_activity_timeout set to 1 hour). The hub was restarting because we have consecutiveFailureLimit set to 20 in the z2jh helm chart values file.

The main concern (and point of this post :smiley:) is that even with the liveness and readiness probes enabled on the proxy by default in z2jh the liveness probe was not detecting any problems. Looking at the actual probe endpoint it looks pretty shallow but I’m also not very familiar with inner workings of the proxy app.

I’m wondering if there is anything we can do to improve that health endpoint based on the errors seen in the chp logs?

It’s also unclear to us if the metrics endpoint is causing excessive traffic and that might be causing things to fail sideways.

We might try upgrading chp to the latest version (4.3.2) but I’m not sure if that will make any difference. We generally don’t have any problems with the proxy (it’s kind of an afterthought) so we haven’t put a ton of focus on what changes go into it from release to release.

Any thoughts on things to investigate or if there is something that can be pulled from the logs would be appreciated.

1 Like

For this:

It looks like that’s already handled with avoid write-after-end by checking res.writableEnded · jupyterhub/configurable-http-proxy@2706a23 · GitHub (Improvements when things go wrong by minrk · Pull Request #290 · jupyterhub/configurable-http-proxy · GitHub) was one of the things I saw in the CHANGELOG for 4.3.0 and made me think we should upgrade).

1 Like

Wow what a beautiful write up, thank you for sharing this @mriedem ! :heart:

A large number of write-after-end can be caused by two things:

  • single-user servers being shutdown, waiting for the Hub to notify the proxy (e.g. caused by internal idle culling and/or crashes, e.g. shutdown_no_activity_timeout). These may cause load on CHP (this is what prompted Improvements when things go wrong by minrk · Pull Request #290 · jupyterhub/configurable-http-proxy · GitHub), apparently due to expensive log statements in what should not be considered a particularly significant event. These events should not be considered a health problem.
  • cluster/container network issue where CHP can’t talk to anything - this can reasonably be considered a health indicator. How exactly to measure this is not clear.

Since you restarted both the hub and proxy, I’m not 100% sure the proxy pod was unhealthy. If, for instance, the Hub failing to notice that a server had died, and proxying lots of failed requests (e.g. from a left-open jupyterlab tab or other app that tries to reconnect) was the source of the problem, the direct fix may have been removing the route. In that case, restarting the Hub pod, which reconciles the proxy routing table with running pods, may have in fact been the fix. Restarting only the proxy pod wouldn’t do this, as the hub pod would re-establish the same routing table as when the proxy pod went down.

So there’s two things here:

  1. the cost of the proxy sending requests to inaccessible endpoints (improved in CHP 4.3.0), and
  2. the reason for the endpoints staying in the routing table in the first place, after the pods become unavailable/stopped, which could be due to:
    a. KubeSpawner failing to notice the container stopping, or
    b. jupyter_server/notebook server failing to finish exiting, so the container is stuck ‘running’ but not actually working (should we have a liveness probe here?), or
    c. JupyterHub failing to cleanup after the Spawner (unlikely without evidence)

That might be able to explain what you experienced without the proxy pod ever actually being ‘unhealthy’. I think 2.b. is what we’ve been seeing on, a liveness probe on user pods that might be the fix you are looking for.

1 Like