Problems with spawning servers: Traefik time-out in case of (not so) many connections

Good morning,

Our TLJH platform had a problem today: an average number of users (around 70) tried to use it simultaneously and some of them could not start their server (a timeout resulting in a bad gateway message). However, the server resources (CPU, RAM and disk access) were well below the maximum available.
In the past, we have already obtained more than 112 active servers simultaneously (I unlocked the max limit thanks to the following parameter: c.JupyterHub.active_server_limit = 0).
By searching the jupyterhub service logs, here is what I get for my user id (955) on our platform tljh.mydomain.fr (my client ip is hidden here by xxx.xxx.xxx.xxx):

nov. 06 09:11:47 tljh python3[2372829]: [I 2024-11-06 09:11:47.111 JupyterHub log:189] 302 GET /hub/spawn/955 -> /hub/spawn-pending/955 (955@xxx.xxx.xxx.xxx) 1008.20ms
nov. 06 09:11:47 tljh python3[2372829]: [I 2024-11-06 09:11:47.122 JupyterHub log:189] 200 GET /hub/spawn-pending/955 (955@xxx.xxx.xxx.xxx) 3.36ms
nov. 06 09:11:49 tljh python3[2372829]: [I 2024-11-06 09:11:49.057 JupyterHub log:189] 200 POST /hub/api/users/955/activity (955@127.0.0.1) 44.00ms
nov. 06 09:11:53 tljh python3[2372829]: [I 2024-11-06 09:11:53.554 JupyterHub proxy:285] Adding user 955 to proxy /user/955/ => http://127.0.0.1:49963
nov. 06 09:11:53 tljh python3[2372829]: [I 2024-11-06 09:11:53.561 JupyterHub proxy:135] Waiting for /user/955 to register with traefik
nov. 06 09:12:21 tljh python3[2372829]:     TimeoutError: Traefik route for /user/955 configuration not available
nov. 06 09:12:22 tljh python3[2372829]: [I 2024-11-06 09:12:22.168 JupyterHub log:189] 200 GET /hub/api/users/955/server/progress (955@xxx.xxx.xxx.xxx) 35005.72ms
nov. 06 09:14:48 tljh python3[2372829]: [W 2024-11-06 09:14:48.711 JupyterHub proxy:423] Deleting stale route /user/955/

It appears that the Traefik service was unable to respond in time to redirect to user 955’s server.
The traefik logs for user 955 do not seem to indicate the cause of this problem:

nov. 06 09:14:43 tljh traefik[2352198]: {"BackendAddr":"127.0.0.1:15001","BackendName":"backend__2F","BackendURL":{"Scheme":"http","Opaque":"","User":null,"Host":"127.0.0.1:15001","Path":"","RawPath":"","ForceQuery":false,"RawQuery":"","Fragment":"","RawFragment":""},"ClientAddr":"xxx.xxx.xxx.xxx:52077","ClientHost":"xxx.xxx.xxx.xxx","ClientPort":"52077","ClientUsername":"-","DownstreamContentSize":11,"DownstreamStatus":502,"DownstreamStatusLine":"502 Bad Gateway","Duration":392155,"FrontendName":"frontend__2F","OriginContentSize":11,"OriginDuration":223507,"OriginStatus":502,"OriginStatusLine":"502 Bad Gateway","Overhead":168648,"RequestAddr":"tljh.mydomain.fr","RequestContentSize":0,"RequestCount":103999,"RequestHost":"tljh.mydomain.fr","RequestLine":"GET /hub/spawn-pending/955 HTTP/2.0","RequestMethod":"GET","RequestPath":"/hub/spawn-pending/955","RequestPort":"-","RequestProtocol":"HTTP/2.0","RetryAttempts":0,"StartLocal":"2024-11-06T09:14:43.18722761Z","StartUTC":"2024-11-06T09:14:43.18722761Z","level":"info","msg":"","request_Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7","request_Accept-Encoding":"gzip, deflate, br, zstd","request_Accept-Language":"fr-FR,fr;q=0.9,en-US;q=0.8,en;q=0.7","request_Cache-Control":"max-age=0","request_Cookie":"REDACTED","request_If-None-Match":"\"dfdabbe9d5bc873247c388fcb911fd3511983f19\"","request_Priority":"u=0, i","request_Referer":"https://tljh.mydomain.fr/hub/home","request_Sec-Ch-Ua":"\"Chromium\";v=\"130\", \"Google Chrome\";v=\"130\", \"Not?A_Brand\";v=\"99\"","request_Sec-Ch-Ua-Mobile":"?0","request_Sec-Ch-Ua-Platform":"\"Windows\"","request_Sec-Fetch-Dest":"document","request_Sec-Fetch-Mode":"navigate","request_Sec-Fetch-Site":"same-origin","request_Sec-Fetch-User":"?1","request_Upgrade-Insecure-Requests":"1","request_User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/130.0.0.0 Safari/537.36","time":"2024-11-06T09:14:43Z"}
nov. 06 09:14:43 tljh traefik[2352198]: {"BackendAddr":"127.0.0.1:15001","BackendName":"backend__2F","BackendURL":{"Scheme":"http","Opaque":"","User":null,"Host":"127.0.0.1:15001","Path":"","RawPath":"","ForceQuery":false,"RawQuery":"","Fragment":"","RawFragment":""},"ClientAddr":"xxx.xxx.xxx.xxx:52077","ClientHost":"xxx.xxx.xxx.xxx","ClientPort":"52077","ClientUsername":"-","DownstreamContentSize":11,"DownstreamStatus":502,"DownstreamStatusLine":"502 Bad Gateway","Duration":701573,"FrontendName":"frontend__2F","OriginContentSize":11,"OriginDuration":438636,"OriginStatus":502,"OriginStatusLine":"502 Bad Gateway","Overhead":262937,"RequestAddr":"tljh.mydomain.fr","RequestContentSize":0,"RequestCount":104000,"RequestHost":"tljh.mydomain.fr","RequestLine":"GET /favicon.ico HTTP/2.0","RequestMethod":"GET","RequestPath":"/favicon.ico","RequestPort":"-","RequestProtocol":"HTTP/2.0","RetryAttempts":0,"StartLocal":"2024-11-06T09:14:43.203124895Z","StartUTC":"2024-11-06T09:14:43.203124895Z","level":"info","msg":"","request_Accept":"image/avif,image/webp,image/apng,image/svg+xml,image/*,*/*;q=0.8","request_Accept-Encoding":"gzip, deflate, br, zstd","request_Accept-Language":"fr-FR,fr;q=0.9,en-US;q=0.8,en;q=0.7","request_Cookie":"REDACTED","request_Priority":"u=1, i","request_Referer":"https://tljh.mydomain.fr/hub/spawn-pending/955","request_Sec-Ch-Ua":"\"Chromium\";v=\"130\", \"Google Chrome\";v=\"130\", \"Not?A_Brand\";v=\"99\"","request_Sec-Ch-Ua-Mobile":"?0","request_Sec-Ch-Ua-Platform":"\"Windows\"","request_Sec-Fetch-Dest":"image","request_Sec-Fetch-Mode":"no-cors","request_Sec-Fetch-Site":"same-origin","request_User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/130.0.0.0 Safari/537.36","time":"2024-11-06T09:14:43Z"}

Does anyone have any ideas on the cause of the problem or a way to get more informations to determine it?

Thanks in advance.

PS:
Here are the versions of the installed applications :
IPython : 8.12.0
ipykernel : 6.22.0
ipywidgets : 7.7.5
jupyter_client : 7.0.6
jupyter_core : 5.3.0
jupyter_server : 1.23.6
jupyterlab : 3.6.3
nbclient : 0.7.3
nbconvert : 7.3.1
nbformat : 5.8.0
notebook : 6.5.4
qtconsole : not installed
traitlets : 5.9.0

1 Like

Can you share pip freeze for the hub env? This sounds like jupyterhub-traefik-proxy is very out of date. It should be at least 1.0, and 2.0 if you are using up-to-date tljh (also 2.0).