Getting occasional errors in my Jupyterhub instances

SETUP & CONFIG:
Am running 2 Jupyterhub v4.0.2 instances running in 2 individual servers load balanced (round robin) by Haproxy in a third server. Both the instances are running inside their anaconda environments. These instances use Systemd spawner (PAM authentication). The configuration is as follows:

import os
import sys
import logging
c = get_config()  #noqa
c.Jupyterhub.concurrent_spawn_limit = 0
c.EventLog.handlers = [
    logging.FileHandler('/var/log/jupyterhub/event.log'),
]
c.EventLog.allowed_schemas = [
    'hub.jupyter.org/server-action'
]
c.JupyterHub.bind_url = 'http://:8000'
c.Authenticator.admin_users = {'root'}
c.JupyterHub.ip = '0.0.0.0'
c.JupyterHub.shutdown_on_logout = True
c.Spawner.start_timeout = 120
c.JupyterHub.services = [
    {
        'name': 'idle-culler',
        'command': [sys.executable, '-m', 'jupyterhub_idle_culler', '--timeout=3600', '--api-page-size=200'],
    }
]
c.JupyterHub.load_roles = [
    {
        "name": "list-and-cull", # name the role
        "services": [
            "idle-culler", # assign the service to this role
        ],
        "scopes": [
            # declare what permissions the service should have
            "list:users", # list users
            "read:users:activity", # read user last-activity
            "admin:servers", # start/stop servers
        ],
    }
]
c.JupyterHub.spawner_class = 'systemdspawner.SystemdSpawner'
c.SystemdSpawner.mem_limit = '700M'
c.SystemdSpawner.cpu_limit = 1.0
c.SystemdSpawner.user_workingdir = '/home/{USERNAME}'
c.SystemdSpawner.isolate_tmp = True
c.SystemdSpawner.disable_user_sudo = True
c.Spawner.args = ['--allow-root']
c.Spawner.default_url = '/lab'
c.Authenticator.delete_invalid_users = True

LOAD BALANCER CONFIG:

frontend https_frontend
    mode http
    bind :80
    bind :443 ssl crt /etc/haproxy/key.pem
    http-request redirect scheme https unless { ssl_fc }
    default_backend app

backend app
    balance roundrobin
    mode http
    cookie HA_cookie maxidle 10m maxlife 1h insert indirect nocache dynamic
    dynamic-cookie-key fdfhauhfe923732
    server s1 192.168.3.16:8000 check
    server s2 192.168.3.17:8000 check

EXTENSIONS THAT AM USING:
nbconvert,
r-irkernel,
octave_kernel,

ERRORS THAT AM GETTING (THESE ARE FROM BOTH THE SERVERS)

  1. UNKNOWN ERROR FROM CONFIG PROXY COMES EVERY 5 SECONDS IN THE LOGS (when this error comes a lot, the website gets very slow, resulting in a 504 gateway timeout)
Apr 02 11:03:42 Lab1 jupyterhub[1835598]: 11:03:42.626 [ConfigProxy] error: Uncaught Exception: read ECONNRESET
Apr 02 11:03:42 Lab1 jupyterhub[1835598]: 11:03:42.626 [ConfigProxy] error: Error: read ECONNRESET
  1. 404 OCCATIONALLY
Apr 01 09:21:50 Lab1 jupyterhub[636410]: [I 2024-04-01 09:21:50.294 JupyterHub log:191] 302 GET /user/user1/api/contents?content=1&1711943519689 -> /hub/user/user1/api/contents?content=1&1711943519689 (@192.168.3.12) 0.39ms
Apr 01 09:21:50 Lab1 jupyterhub[636410]: [W 2024-04-01 09:21:50.304 JupyterHub web:1869] 404 GET /hub/user/user1/api/contents?content=1&1711943519689 (192.168.3.12): No access to resources or resources not found
Apr 01 09:21:50 Lab1 jupyterhub[636410]: [W 2024-04-01 09:21:50.305 JupyterHub log:191] 404 GET /hub/user/user1/api/contents?content=1&1711943519689 (user1@192.168.3.12) 1.71ms
Apr 01 09:21:50 Lab1 jupyterhub[636410]: [I 2024-04-01 09:21:50.317 JupyterHub log:191] 302 GET /user/user1/api/contents?content=1&1711943519712 -> /hub/user/user1/api/contents?content=1&1711943519712 (@192.168.3.12) 0.42ms

THINGS THAT AM SURE THAT ISN’T THE PROBLEM:

  1. The errors are not due to directory permissions.
  2. Not due to insufficient memory or CPU.

TEMPORARY SOLUTION I FOUND:

conda update --all

This solved the issue of 404 not found error. But still am not sure if this actually solved it or not. But am not getting the 404 error just for now.

CAN SOMEONE HELP ME OUT IN DEBUGGING THE ISSUE.???

If you’re using load-balancing without sticky sessions a user may sent to one JupyterHub in some requests but be sent to the other JupyterHub on other requests. I’m not familiar enough with haproxy configuration to judge whether your configuration is sufficient (once a user logs in to one JupyterHub are they guaranteed to always be sent to that hub in future whilst their singleuser server is active?).

Is this is from a haproxy health check? Maybe try a HTTP check instead of TCP?

What’s probably worth doing is to remove one of your JupyterHub servers from haproxy and focus on fixing all issues without the additional complexity of load-balancing.

Another option: give your JupyterHub servers separate domain names (hub-1.example.org, hub-2.example.org), and either split your users between the two servers, or redirect them randomly.

I should also mention there are ways to properly “load-balance” your user servers across multiple servers with a single JupyterHub, for example with Kubernetes Zero to JupyterHub with Kubernetes — Zero to JupyterHub with Kubernetes documentation

I can guarantee that stick sessions are working fine. Also the load balancer doesn’t seem to have the problem as well (as I’ve tested the application without the LB and still getting the same ECONNRESET). Tried changing the health check to HTTP too, still it was of no use. Currently running a single server with the LB. We are not going for kubernetes for now.

Getting ECONNRESET at regular intervals of 10 seconds is not causing an issue. But at random times like once in a day, the web server (without LB) gets slow (i.e., when browsing hub1.example.com → takes a lot of time to load (>2min) and eventually gets a timeout ) . At time of this issue the logs are:

Is there any way that I can know more abt this error? Increasing the Log level to ‘DEBUG’ doesn’t help coz it didnt increase the level for configproxy. Help is needed.

Am including the other part of the error below at the time of crash as am restricted to only 1 image in a reply.

Can you show us your JupyterHub logs with debugging turned on?

What does your system resource consumption look like when the slowdown occurs?

Do you see the same slowdown if you access JupyterHub directly instead of via HAProxy?