Hi, we’re using z2jh 0.10.6 with jupyterhub 1.2.2, kubespawner 0.15 and configurable-http-proxy (chp) 4.2.2 and yesterday the chp went sideways which caused the hub to fail to spawn user notebooks during a large user event. This started around 2:50 CT. We had ~651 active notebook singleuser-server pods before the hub started crashing:
This graph shows the restarts on the hub starting around that same time:
We had plenty of user node capacity so that wasn’t the problem (no auto-scaler kicking in or anything like that). We eventually figured out the proxy was having issues, that correlates with errors in the logs from the chp
starting around the same time:
Here is an example of a common error seen in the chp
logs:
May 24 15:54:08 proxy-7fc546c64c-pfbjw chp error 20:54:08.752 [ConfigProxy] error: Uncaught Exception: write after end
May 24 15:54:08 proxy-7fc546c64c-pfbjw chp error 20:54:08.753 [ConfigProxy] error: Error [ERR_STREAM_WRITE_AFTER_END]: write after end
at writeAfterEnd (_stream_writable.js:266:14)
at Socket.Writable.write (_stream_writable.js:315:5)
at IncomingMessage.<anonymous> (/srv/configurable-http-proxy/lib/configproxy.js:458:30)
at IncomingMessage.emit (events.js:314:20)
at IncomingMessage.Readable.read (_stream_readable.js:508:10)
at flow (_stream_readable.js:1008:34)
at resume_ (_stream_readable.js:989:3)
at processTicksAndRejections (internal/process/task_queues.js:84:21)
And this:
20:10:04.132 [ConfigProxy] error: 503 GET /user/5f3cb2ffdbb8380013400afe/api/metrics/v1 socket hang up
This graph also shows the rate of 500+ errors from the proxy during that window:
Things got better after ~3:55pm CT when we restarted the proxy
deployment and then we also restarted the hub
for good measure.
We think the active user pods were idling out after an hour of inactivity (we have c.NotebookApp.shutdown_no_activity_timeout
set to 1 hour). The hub was restarting because we have consecutiveFailureLimit
set to 20 in the z2jh helm chart values file.
The main concern (and point of this post ) is that even with the liveness and readiness probes enabled on the proxy by default in z2jh the liveness probe was not detecting any problems. Looking at the actual probe endpoint it looks pretty shallow but I’m also not very familiar with inner workings of the proxy app.
I’m wondering if there is anything we can do to improve that health endpoint based on the errors seen in the chp
logs?
It’s also unclear to us if the metrics
endpoint is causing excessive traffic and that might be causing things to fail sideways.
We might try upgrading chp
to the latest version (4.3.2) but I’m not sure if that will make any difference. We generally don’t have any problems with the proxy (it’s kind of an afterthought) so we haven’t put a ton of focus on what changes go into it from release to release.
Any thoughts on things to investigate or if there is something that can be pulled from the logs would be appreciated.