Trouble with Systemd limits working in Jupyterhub

Hi,
We have a cluster where we have Jupyterhub running, we offer the users the option to either submit jupyter as a slurm job for which we use profilespawner or a local development, which runs natively on the node. However, we have observed that on the node the jupyterhub instance does not obey systemd limits set.

I checked with the systemd checker script available on systemdspawner and it returned Memory and CPU limiting as enabled, which should be the case (as it is enabled as a per user slice.

The config is as follow

import batchspawner
c.JupyterHub.cleanup_servers = False

c.Authenticator.allow_all = True
c.Spawner.env_keep = ['PATH', 'PYTHONPATH', 'CONDA_ROOT', 'CONDA_DEFAULT_ENV', 'VIRTUAL_ENV', 'LANG', 'LC_ALL', 'JUPYTERHUB_SINGLEUSER_APP']

c.Spawner.start_timeout = 120

c.JupyterHub.spawner_class = 'wrapspawner.ProfilesSpawner'

c.Spawner.cmd = ['jupyter-labhub']
c.Spawner.http_timeout = 120
c.SlurmSpawner.batch_script = '''#!/bin/bash
#SBATCH --output={{homedir}}/Jupyter/jupyterhub_slurmspawner_%j.log
#SBATCH --job-name=jupyterhub
#SBATCH --chdir={{homedir}}
#SBATCH --export={{keepvars}}
#SBATCH --constraint=bookworm
#SBATCH --get-user-env=L
{% if partition  %}#SBATCH --partition={{partition}}
{% endif %}{% if runtime    %}#SBATCH --time={{runtime}}
{% endif %}{% if memory     %}#SBATCH --mem={{memory}}
{% endif %}{% if gres       %}#SBATCH --gres={{gres}}
{% endif %}{% if nprocs     %}#SBATCH --cpus-per-task={{nprocs}}
{% endif %}{% if reservation%}#SBATCH --reservation={{reservation}}
{% endif %}{% if options    %}#SBATCH {{options}}{% endif %}

set -euo pipefail
trap 'echo SIGTERM received' TERM

module load git
module load jupyterhub/1.1

{{prologue}}
which jupyterhub-singleuser
{{cmd}}
echo "jupyterhub-singleuser ended gracefully"
{{epilogue}}
'''

### SystemdSpawner config

c.SystemdSpawner.mem_limit = '16G'
c.SystemdSpawner.cpu_limit = 4.0
c.SystemdSpawner.disable_user_sudo = True
c.ProfilesSpawner.ip = '0.0.0.0'

c.ProfilesSpawner.profiles = [
 
 ('Local server - Use it !*ONLY FOR DEVELOPMENT*! 16GB RAM, 8 CPUs', 'local_limited', 'systemdspawner.SystemdSpawner', {'ip':'0.0.0.0', 'limits':{'mem_limit':'16G', 'cpu_limit':'4.0'}}),
 ('mycluster - 1 CPU core, 4GB RAM, No GPU, 8 hours', 'mycluster1c4gb0gpu8h', 'batchspawner.SlurmSpawner', dict(req_nprocs='1', req_partition='default', req_runtime='8:00:00', req_memory='4G', req_gpu='0')),

 ('mycluster - 8 CPU core, 20GB RAM, No GPU, 48 hours', 'cluster8c20gb0gpu48h', 'batchspawner.SlurmSpawner', dict(req_nprocs='8', req_partition='default', req_runtime='48:00:00', req_memory='20G', req_gpu='0')),
 ('mycluster - 16 CPU core, 32GB RAM, No GPU, 48 hours', 'cluster8c32gb0gpu48h', 'batchspawner.SlurmSpawner', dict(req_nprocs='16', req_partition='default', req_runtime='48:00:00', req_memory='32G', req_gpu='0')),
 ('mycluster - 4 CPU core, 60GB RAM, No GPU, 48 hours', 'mycluster4c60gb0gpu48h', 'batchspawner.SlurmSpawner', dict(req_nprocs='4', req_partition='default', req_runtime='48:00:00', req_memory='60G', req_gpu='0')),

]

I don’t see anything in the logs apart from a few errors:

[I 2024-11-13 15:18:26.905 JupyterHub app:3352] Running JupyterHub version 5.2.0
[I 2024-11-13 15:18:26.905 JupyterHub app:3382] Using Authenticator: jupyterhub.auth.PAMAuthenticator-5.2.0
[I 2024-11-13 15:18:26.905 JupyterHub app:3382] Using Spawner: wrapspawner.wrapspawner.ProfilesSpawner
[I 2024-11-13 15:18:26.905 JupyterHub app:3382] Using Proxy: jupyterhub.proxy.ConfigurableHTTPProxy-5.2.0
[I 2024-11-13 15:18:26.907 JupyterHub app:1837] Loading cookie_secret from /etc/jupyterhub-test/jupyterhub_cookie_secret
[I 2024-11-13 15:18:26.954 JupyterHub proxy:556] Generating new CONFIGPROXY_AUTH_TOKEN
[W 2024-11-13 15:18:26.990 JupyterHub spawner:179]
    The shared database session at Spawner.db is deprecated, and will be removed.
    Please manage your own database and connections.

    Contact JupyterHub at https://github.com/jupyterhub/jupyterhub/issues/3700
    if you have questions or ideas about direct database needs for your Spawner.

[W 2024-11-13 15:18:27.003 JupyterHub spawner:179]
    The shared database session at Spawner.db is deprecated, and will be removed.
    Please manage your own database and connections.

    Contact JupyterHub at https://github.com/jupyterhub/jupyterhub/issues/3700
    if you have questions or ideas about direct database needs for your Spawner.


[I 2024-11-13 15:18:27.026 JupyterHub app:3059] user1 still running
[I 2024-11-13 15:18:27.026 JupyterHub app:3422] Initialized 2 spawners in 0.045 seconds
[I 2024-11-13 15:18:27.029 JupyterHub metrics:373] Found 2 active users in the last ActiveUserPeriods.twenty_four_hours
[I 2024-11-13 15:18:27.030 JupyterHub metrics:373] Found 2 active users in the last ActiveUserPeriods.seven_days
[I 2024-11-13 15:18:27.030 JupyterHub metrics:373] Found 4 active users in the last ActiveUserPeriods.thirty_days
[W 2024-11-13 15:18:27.030 JupyterHub proxy:625] Found proxy pid file: /etc/jupyterhub-test/jupyterhub-proxy.pid
[W 2024-11-13 15:18:27.030 JupyterHub proxy:642] Proxy still running at pid=1274690
[W 2024-11-13 15:18:29.031 JupyterHub proxy:662] Stopped proxy at pid=1274690
[I 2024-11-13 15:18:29.032 JupyterHub proxy:752] Starting proxy @ https://x.x.x.x:443/
[E 2024-11-13 15:18:29.039 JupyterHub proxy:949] api_request to proxy failed: HTTP 403: Forbidden
[E 2024-11-13 15:18:29.039 JupyterHub app:3921]
    Traceback (most recent call last):
      File "/mnt/nfs/clustersw/Debian/bookworm/jupyterhub/1.1/lib/python3.11/site-packages/jupyterhub/app.py", line 3919, in launch_instance_async
        await self.start()
      File "/mnt/nfs/clustersw/Debian/bookworm/jupyterhub/1.1/lib/python3.11/site-packages/jupyterhub/app.py", line 3706, in start
        await self.proxy.get_all_routes()
      File "/mnt/nfs/clustersw/Debian/bookworm/jupyterhub/1.1/lib/python3.11/site-packages/jupyterhub/proxy.py", line 989, in get_all_routes
        resp = await self.api_request('', client=client)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "/mnt/nfs/clustersw/Debian/bookworm/jupyterhub/1.1/lib/python3.11/site-packages/jupyterhub/proxy.py", line 953, in api_request
        result = await exponential_backoff(
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "/mnt/nfs/clustersw/Debian/bookworm/jupyterhub/1.1/lib/python3.11/site-packages/jupyterhub/utils.py", line 249, in exponential_backoff
        ret = await maybe_future(pass_func(*args, **kwargs))
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "/mnt/nfs/clustersw/Debian/bookworm/jupyterhub/1.1/lib/python3.11/site-packages/jupyterhub/proxy.py", line 938, in _wait_for_api_request
        return await client.fetch(req)
               ^^^^^^^^^^^^^^^^^^^^^^^
    tornado.httpclient.HTTPClientError: HTTP 403: Forbidden
15:18:29.181 [ConfigProxy] ^[[32minfo^[[39m: Proxying https://x.x.x.x:443 to (no default)
15:18:29.182 [ConfigProxy] ^[[32minfo^[[39m: Proxy API at http://127.0.0.1:8001/api/routes
15:18:29.188 [ConfigProxy] ^[[31merror^[[39m: Uncaught Exception: listen EADDRINUSE: address already in use x.x.x.x:443
15:18:29.188 [ConfigProxy] ^[[31merror^[[39m: Error: listen EADDRINUSE: address already in use 10.36.98.39:443
    at Server.setupListenHandle [as _listen2] (node:net:1897:16)
    at listenInCluster (node:net:1945:12)
    at doListen (node:net:2109:7)
    at process.processTicksAndRejections (node:internal/process/task_queues:83:21)
15:18:29.188 [ConfigProxy] ^[[31merror^[[39m: Uncaught Exception: listen EADDRINUSE: address already in use 127.0.0.1:8001
15:18:29.189 [ConfigProxy] ^[[31merror^[[39m: Error: listen EADDRINUSE: address already in use 127.0.0.1:8001
    at Server.setupListenHandle [as _listen2] (node:net:1897:16)
    at listenInCluster (node:net:1945:12)
    at doListen (node:net:2109:7)
    at process.processTicksAndRejections (node:internal/process/task_queues:83:21)

How do you conclude that the systemd limits are not being imposed on single user servers? I mean what tests are you doing to check that?

I log in as my user on jupyterhub and run a process that consumes 100G of RAM. I get 100G of RAM. Which is far above the 16GB I set.

The program is a simply nonsensical tensor multiplier that generates random tensors and multiplies them to get approximate the amount of RAM I want to consume.

1 Like

Just to verify few things, once you spawn a single user server, open a terminal and execute cat /proc/self/cgroup. This command will give you the current cgroup(s) for the process. You will see a path there after two colons. Append that path with /sys/fs/cgroup/ to check what are the memory settings of the cgroup. For instance, it will something like /sys/fs/cgroup/<path you get>/{memory.max,memory.high}. Check if these values are set to 16G or not.

You can also check the systemd command created by JupyterHub to ensure that all options are properly passed to the systemd.

Ah I see, the output is:
max
max
for both memory.max and memory.high

Similarly, cpu.max is set to
max 100000

Well, thats a start. What is the full systemd command that the JupyterHub launched? I think you should be able to get it using ps aux command

I assume you are running JupyterHub as a systemd service? Can you execute systemd-cgls command to check the hierarchy of cgroups?

I had a look at the systemd process:

MemoryAccounting=yes
DefaultMemoryLow=0
DefaultMemoryMin=0
MemoryMin=0
MemoryLow=0
MemoryHigh=infinity
MemoryMax=infinity
MemorySwapMax=infinity
MemoryLimit=infinity
ManagedOOMMemoryPressure=auto
ManagedOOMMemoryPressureLimit=0

CPUAccounting is also enabled by the systemd process that is running, but it again says I can use everything.
For my specific instance, but it is the same for all instances.

Regarding what command is launched, from what I could see:

th=pathto/jupyterhub/1.1/bin/jupyter-labhub ; argv[]=pathto/jupyterhub/1.1/bin/jupyter-labhub ; ignore_errors=no ; start_time=[Wed 2024-11-13 10:56:49 CET] ; stop_time=[n/a] ; pid=1143502 ; code=(null) ; status=0/0

From ps aux:

pathto/jupyterhub/1.1/bin/python3.11 -m ipykernel_launcher -f myuserHome/.local/share/jupyter/runtime/kernel-0a7971a0-d7b6-4ddb-abbb-8b22568e64fb.json

Yes, it falls under
system.slice

─jupyter-myUser-singleuser.service (#2260442)
    → user.invocation_id: 6fc1d6a79d0143878621ba7063c22036
    → trusted.invocation_id: 6fc1d6a79d0143878621ba7063c22036
    ├─1143502 pathto/bookworm/jupyterhub/1.1/bin/python3.11 pathto/bookworm/jupyterhub/1.1/bin/jupyter-labhub
    ├─1144881 /bin/bash -l
    ├─1632933 pathto/jupyterhub/1.1/bin/python3.11 -m ipykernel_launcher -f myHome/.local/share/jupyter/runtime/kernel-0a7971a0-d7b6-4ddb-abbb-8b22568e64fb.json
    └─1632950 /bin/bash -l

Probably this is the issue. SystemdSpawner does not have any traitlet called limits, if I am not wrong. Can you try using mem_limit and cpu_limit without putting them inside limits dict?

I tried this and it still is the same:

Now my config reads:

 ('Local server - Use it !*ONLY FOR DEVELOPMENT*! 16GB RAM, 8 CPUs', 'local_limited', 'systemdspawner.SystemdSpawner', {'ip':'0.0.0.0', 'mem_limit':'16G', 'cpu_limit':'4.0'}),

However the slice still shows:

systemctl show jupyter-muUser-singleuser.service | grep Memory
MemoryCurrent=107849744384
MemoryAvailable=infinity
EffectiveMemoryNodes=0-1
MemoryAccounting=yes

I saw that on the systemd spawner they say to use
c.JupyterHub.spawner_class = “systemd”

However, I am using
c.JupyterHub.spawner_class = ‘wrapspawner.ProfilesSpawner’
which then calls systemd spawner, but I am unsure if I have to call c.JupyterHub spawner for systemd as a child of Profilesspawner.

When debugging systems it’s usually easier to simplify things as much as possible, as you may have multiple interacting problems across different components. Can you try running systemd spawner only?

Hi,
That is a good suggestion, I did this now and it still does not work. Now my config apart from secrets looks like:

import batchspawner
c.JupyterHub.hub_connect_ip = 'x.x.x.x'
c.JupyterHub.hub_ip = 'x.x.x.x'
c.JupyterHub.cleanup_servers = False
c.Authenticator.allow_all = True

c.Spawner.env_keep = ['PATH', 'PYTHONPATH', 'CONDA_ROOT', 'CONDA_DEFAULT_ENV', 'VIRTUAL_ENV', 'LANG', 'LC_ALL', 'JUPYTERHUB_SINGLEUSER_APP']
c.Spawner.start_timeout = 120


c.JupyterHub.spawner_class = 'systemdspawner.SystemdSpawner'
c.Spawner.cmd = ['jupyter-labhub']


c.Spawner.http_timeout = 120
                                
c.SystemdSpawner.mem_limit = '16G'
c.SystemdSpawner.cpu_limit = 4.0
c.SystemdSpawner.disable_user_sudo = True
c.ProfilesSpawner.ip = '0.0.0.0'
c.Authenticator.admin_users = {"adminUser"}

It is as simple as possible, I start the jupyter with:

jupyterhub --ip x.x.x.x --port 443 --ssl-key my.key --ssl-cert mycert.pem -f /etc/jupyterhub-test/jupyterhub_test_config.py &> /var/log/JupyterHub.log &

systemctl show jupyter-myUser-singleuser.service
MemoryAccounting=yes
DefaultMemoryLow=0
DefaultMemoryMin=0
MemoryMin=0
MemoryLow=0
MemoryHigh=infinity
MemoryMax=17179869184
MemorySwapMax=infinity
MemoryLimit=infinity

CPUAccounting=yes
CPUWeight=[not set]
StartupCPUWeight=[not set]
CPUShares=[not set]
StartupCPUShares=[not set]
CPUQuotaPerSecUSec=4s


Which is working fine as now Jupyter kills my process if I consume to much RAM. The question is how do I get it to work with profile spawner

For some reason if I do
import systemdspawner
and then do my profiles:

 ('Local server - Use it !*ONLY FOR DEVELOPMENT*! 16GB RAM, 8 CPUs', 'local_limited', 'systemdspawner.SystemdSpawner', {'ip':'0.0.0.0', 'mem_limit':'16G', 'cpu_limit':4.0,'disable_user_sudo': True}),

This seems to work. Indeed it was the mem_limit not being functional as a key for limits and missing a proper restart. I really appreciate all the help you guys gave in troubleshooting this!!

1 Like