Batchspawner works - wrapspawner doesn't

I’m trying to upgrade a server that is only used as backup so I don’t have frequent work with jupyterhub. We are using jupyterhub and slurm on one single server to control jobs. I had a call to update CUDA on the system so I thought upgrades to other pieces would be good also since it had been 3 years since I built it.

Previously with Slurm 20.02.1 and jupyterhub 1.1.0 everything was working by using wrapspawner with profiles. Unfortunately, I don’t seem to be able to see the old versions of wrapspawner and batchspawner.

The upgrade is using jupyterhub 1.5.0 and slurm 22.05.6 (but I don’t think slurm is the issue). wrapspawner is 1.0.2.dev0. batchspawner is 1.2.0 (also not problem as seen below)

The system is CentOS 7. My jupyterhub is deployed from a virtualenv created in /opt. Remainder of pieces are installed as per instructions on jupyterhub website. I added batchspawner and wrapspawner, via pip install, from their GitHub source code.

NOTE: I had to replace http with weblink to try and get the topic through because it says I can only share 2 links as a new user.

If I only use batchspawner and create a Slurm config as such (only showing jupyterhub config for that section) a jupyterlab interface spawns and everything is fine.


c = get_config()

c.JupyterHub.spawner_class = ‘batchspawner.SlurmSpawner’
c.Spawner.environment.update({“JUPYTER_PREFER_ENV_PATH”: “0”})

import batchspawner

c.Spawner.http_timeout = 180
c.Spawner.start_timeout = 300
c.SlurmSpawner.req_partition = ‘faculty’
c.SlurmSpawner.req_ngpus = ‘0’
c.SlurmSpawner.req_nprocs = ‘4’
c.SlurmSpawner.req_memory = ‘8gb’
c.SlurmSpawner.req_runtime = ‘2:00:00’

c.SlurmSpawner.batch_script = ‘’'#!/bin/bash
#SBATCH --output={homedir}/jupyterhub_slurmspawner_%j.log
#SBATCH --job-name=jupyterhub-spawner
#SBATCH --chdir=/home/{username}
#SBATCH --cpus-per-task={nprocs}
#SBATCH --mem={memory}
#SBATCH --time={runtime}
#SBATCH --partition={partition}
#SBATCH --gpus={ngpus}
#SBATCH {options}

export PATH=/opt/jupyterhub/bin:/opt/conda/bin:$PATH

export LD_LIBRARY_PATH=/opt/jupyterhub/lib:/opt/jupyterhub/lib64:/usr/lib64:/usr/lib
env
which jupyterhub-singleuser
{cmd}

‘’’


If I now add in code to use wrapspawner with profiles what happens is that the slurm job does get launched but jupyterhub seems to lose track of it and eventually times out.


import batchspawner

c.Spawner.http_timeout = 180
c.Spawner.start_timeout = 300
c.SlurmSpawner.req_partition = ‘faculty’
c.SlurmSpawner.req_ngpus = ‘0’
c.SlurmSpawner.req_nprocs = ‘4’
c.SlurmSpawner.req_memory = ‘8gb’
c.SlurmSpawner.req_runtime = ‘2:00:00’

c.JupyterHub.spawner_class = ‘wrapspawner.ProfilesSpawner’
c.ProfilesSpawner.profiles = [
(‘Small - 8 cores, 128 GB, 2 hours’, ‘local8c128g2h’, ‘batchspawner.SlurmSpawner’,
dict(req_nprocs=‘8’, req_runtime=‘2:00:00’, req_memory=‘128gb’)),
(‘Medium - 12 cores, 256 GB, 4 hours’, ‘local12c256gb4h’, ‘batchspawner.SlurmSpawner’,
dict(req_nprocs=‘12’, req_runtime=‘4:00:00’, req_memory=‘256gb’)),
(‘Large(max) - 24 cores, 300 GB, 24 hours’, ‘local24c300g24h’, ‘batchspawner.SlurmSpawner’,
dict(req_nprocs=‘24’, req_runtime=‘24:00:00’, req_memory=‘300gb’)),
(‘Small GPU(default 4 CPU) - 2 GPU 64 GB 2 hours’, ‘local2gpu64g2h’, ‘batchspawner.SlurmSpawner’,
dict(req_ngpus=‘2’, req_runtime=‘2:00:00’, req_memory=‘64gb’)),
(‘Medium GPU(default 4 CPU) - 3 GPU 128 GB 8 hours’, ‘local3gpu128g8h’, ‘batchspawner.SlurmSpawner’,
dict(req_ngpus=‘3’, req_runtime=‘8:00:00’, req_memory=‘128gb’)),
(‘Large GPU(default 4 CPU) - 4 GPU 300 GB 12 hours’, ‘local4gpu300g12h’, ‘batchspawner.SlurmSpawner’,
dict(req_ngpus=‘4’, req_runtime=‘12:00:00’, req_memory=‘300gb’)),
(‘Full CPU and GPU - 24 CPU 4 GPU 300 GB 24 hours’, ‘local4gpu300g12h’, ‘batchspawner.SlurmSpawner’,
dict(req_nprocs=‘24’, req_ngpus=‘4’, req_runtime=‘12:00:00’, req_memory=‘300gb’))
]
c.ProfilesSpawner.ip = ‘0.0.0.0’

c.SlurmSpawner.batch_script = ‘’'#!/bin/bash
#SBATCH --output={homedir}/jupyterhub_slurmspawner_%j.log
#SBATCH --job-name=jupyterhub-spawner
#SBATCH --chdir=/home/{username}
#SBATCH --cpus-per-task={nprocs}
#SBATCH --mem={memory}
#SBATCH --time={runtime}
#SBATCH --partition={partition}
#SBATCH --gpus={ngpus}
#SBATCH {options}

export PATH=/opt/jupyterhub/bin:/opt/conda/bin:$PATH
export LD_LIBRARY_PATH=/opt/jupyterhub/lib:/opt/jupyterhub/lib64:/usr/lib64:/usr/lib
env
which jupyterhub-singleuser
{cmd}
‘’’


The log file in my home directory shows that a notebook was launched:


[I 2023-02-01 17:09:09.710 SingleUserNotebookApp notebookapp:1592] Authentication of /metrics is OFF, since other authentication is disabled.
[I 2023-02-01 17:09:10.485 LabApp] JupyterLab extension loaded from /opt/jupyterhub/lib64/python3.8/site-packages/jupyterlab
[I 2023-02-01 17:09:10.485 LabApp] JupyterLab application directory is /opt/jupyterhub/share/jupyter/lab

[I 2023-02-01 17:09:10.494 SingleUserNotebookApp mixins:576] Starting jupyterhub-singleuser server version 1.5.0
[I 2023-02-01 17:09:10.500 SingleUserNotebookApp notebookapp:2328] Serving notebooks from local directory: /home/me
[I 2023-02-01 17:09:10.500 SingleUserNotebookApp notebookapp:2328] Jupyter Notebook 6.5.2 is running at:
[I 2023-02-01 17:09:10.500 SingleUserNotebookApp notebookapp:2328] weblink://gen-datasci:60073/user/me
[I 2023-02-01 17:09:10.500 SingleUserNotebookApp notebookapp:2329] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
[I 2023-02-01 17:09:10.507 SingleUserNotebookApp mixins:556] Updating Hub with activity every 300 seconds


However, it eventually times out with the following message in the web interface:
Spawn failed: Server at weblink://gen-datasci:0/user/me/ didn’t respond in 180 seconds


jupyterhub log shows:

Feb 01 17:09:08 gen-datasci jupyterhub[216606]: batchspawner-singleuser jupyterhub-singleuser --ip=0.0.0.0 --SingleUserNotebookApp.default_url=/lab
Feb 01 17:09:08 gen-datasci sudo[216621]: root : TTY=unknown ; PWD=/ ; USER=me ; COMMAND=/bin/sbatch --parsable
Feb 01 17:09:08 gen-datasci jupyterhub[216606]: [I 2023-02-01 17:09:08.464 JupyterHub batchspawner:295] Job submitted. cmd: sudo -E -u me sbatch --parsable output: 44
Feb 01 17:09:08 gen-datasci sudo[216627]: root : TTY=unknown ; PWD=/ ; USER=me ; COMMAND=/bin/squeue -h -j 44 -o %T %B
Feb 01 17:09:09 gen-datasci sudo[216652]: root : TTY=unknown ; PWD=/ ; USER=me ; COMMAND=/bin/squeue -h -j 44 -o %T %B
Feb 01 17:09:09 gen-datasci jupyterhub[216606]: [W 2023-02-01 17:09:09.308 JupyterHub base:170] Rolling back dirty objects IdentitySet([<Server(:0)>])
Feb 01 17:09:09 gen-datasci jupyterhub[216606]: [I 2023-02-01 17:09:09.322 JupyterHub log:189] 302 POST /hub/spawn/me?next=%2Fhub%2Fuser%2Fme%2Flogout → /hub/spawn-pending/me?next=%2Fhub%2Fuser%2Fme%2Flog
Feb 01 17:09:09 gen-datasci jupyterhub[216606]: [I 2023-02-01 17:09:09.352 JupyterHub pages:402] me is pending spawn
Feb 01 17:09:09 gen-datasci jupyterhub[216606]: [I 2023-02-01 17:09:09.368 JupyterHub log:189] 200 GET /hub/spawn-pending/me?next=%2Fhub%2Fuser%2Fme%2Flogout (me@::ffff:127.0.0.1) 42.87ms
Feb 01 17:09:09 gen-datasci jupyterhub[216606]: [I 2023-02-01 17:09:09.485 JupyterHub log:189] 200 POST /hub/api/batchspawner (me@127.0.0.1) 33.73ms
Feb 01 17:09:09 gen-datasci jupyterhub[216606]: [I 2023-02-01 17:09:09.611 JupyterHub batchspawner:463] Notebook server job 44 started at gen-datasci:60073
Feb 01 17:09:10 gen-datasci jupyterhub[216606]: [I 2023-02-01 17:09:10.499 JupyterHub log:189] 200 GET /hub/api (@127.0.0.1) 1.35ms
Feb 01 17:09:10 gen-datasci jupyterhub[216606]: [I 2023-02-01 17:09:10.537 JupyterHub log:189] 200 POST /hub/api/users/me/activity (me@127.0.0.1) 28.47ms
Feb 01 17:09:18 gen-datasci sudo[216681]: root : TTY=unknown ; PWD=/ ; USER=me ; COMMAND=/bin/squeue -h -j 44 -o %T %B
Feb 01 17:09:18 gen-datasci jupyterhub[216606]: [W 2023-02-01 17:09:18.391 JupyterHub base:1040] User me is slow to become responsive (timeout=10)
Feb 01 17:09:39 gen-datasci sudo[216691]: root : TTY=unknown ; PWD=/ ; USER=me ; COMMAND=/bin/squeue -h -j 44 -o %T %B
Feb 01 17:10:09 gen-datasci sudo[216719]: root : TTY=unknown ; PWD=/ ; USER=me ; COMMAND=/bin/squeue -h -j 44 -o %T %B
Feb 01 17:10:39 gen-datasci sudo[216728]: root : TTY=unknown ; PWD=/ ; USER=me ; COMMAND=/bin/squeue -h -j 44 -o %T %B
Feb 01 17:11:09 gen-datasci sudo[216746]: root : TTY=unknown ; PWD=/ ; USER=me ; COMMAND=/bin/squeue -h -j 44 -o %T %B
Feb 01 17:11:39 gen-datasci sudo[216755]: root : TTY=unknown ; PWD=/ ; USER=me ; COMMAND=/bin/squeue -h -j 44 -o %T %B

Feb 01 17:12:09 gen-datasci sudo[216772]: root : TTY=unknown ; PWD=/ ; USER=me ; COMMAND=/bin/squeue -h -j 44 -o %T %B

Feb 01 17:12:24 gen-datasci jupyterhub[216606]: [W 2023-02-01 17:12:24.823 JupyterHub user:767] me’s server never showed up at weblink://gen-datasci:0/user/me/ after 180 seconds. Giving up
Feb 01 17:12:24 gen-datasci sudo[216781]: root : TTY=unknown ; PWD=/ ; USER=me ; COMMAND=/bin/squeue -h -j 44 -o %T %B
Feb 01 17:12:24 gen-datasci jupyterhub[216606]: [I 2023-02-01 17:12:24.904 JupyterHub batchspawner:477] Stopping server job 44
Feb 01 17:12:24 gen-datasci jupyterhub[216606]: [I 2023-02-01 17:12:24.904 JupyterHub batchspawner:355] Cancelling job 44: sudo -E -u me scancel 44
Feb 01 17:12:24 gen-datasci sudo[216789]: root : TTY=unknown ; PWD=/ ; USER=me ; COMMAND=/bin/scancel 44
Feb 01 17:12:25 gen-datasci sudo[216805]: root : TTY=unknown ; PWD=/ ; USER=me ; COMMAND=/bin/squeue -h -j 44 -o %T %B
Feb 01 17:12:26 gen-datasci sudo[216819]: root : TTY=unknown ; PWD=/ ; USER=me ; COMMAND=/bin/squeue -h -j 44 -o %T %B
Feb 01 17:12:27 gen-datasci sudo[216828]: root : TTY=unknown ; PWD=/ ; USER=me ; COMMAND=/bin/squeue -h -j 44 -o %T %B
Feb 01 17:12:27 gen-datasci jupyterhub[216606]: [E 2023-02-01 17:12:27.282 JupyterHub gen:630] Exception in Future <Task finished name=‘Task-33’ coro=<BaseHandler.spawn_single_user..finish_user_spawn() done, defin
Feb 01 17:12:27 gen-datasci jupyterhub[216606]: Traceback (most recent call last):
Feb 01 17:12:27 gen-datasci jupyterhub[216606]: File “/opt/jupyterhub/lib64/python3.8/site-packages/tornado/gen.py”, line 625, in error_callback
Feb 01 17:12:27 gen-datasci jupyterhub[216606]: future.result()
Feb 01 17:12:27 gen-datasci jupyterhub[216606]: File “/opt/jupyterhub/lib64/python3.8/site-packages/jupyterhub/handlers/base.py”, line 907, in finish_user_spawn
Feb 01 17:12:27 gen-datasci jupyterhub[216606]: await spawn_future
Feb 01 17:12:27 gen-datasci jupyterhub[216606]: File “/opt/jupyterhub/lib64/python3.8/site-packages/jupyterhub/user.py”, line 748, in spawn
Feb 01 17:12:27 gen-datasci jupyterhub[216606]: await self._wait_up(spawner)
Feb 01 17:12:27 gen-datasci jupyterhub[216606]: File “/opt/jupyterhub/lib64/python3.8/site-packages/jupyterhub/user.py”, line 795, in _wait_up
Feb 01 17:12:27 gen-datasci jupyterhub[216606]: raise e
Feb 01 17:12:27 gen-datasci jupyterhub[216606]: File “/opt/jupyterhub/lib64/python3.8/site-packages/jupyterhub/user.py”, line 762, in _wait_up
Feb 01 17:12:27 gen-datasci jupyterhub[216606]: resp = await server.wait_up(
Feb 01 17:12:27 gen-datasci jupyterhub[216606]: File “/opt/jupyterhub/lib64/python3.8/site-packages/jupyterhub/utils.py”, line 236, in wait_for_http_server
Feb 01 17:12:27 gen-datasci jupyterhub[216606]: re = await exponential_backoff(
Feb 01 17:12:27 gen-datasci jupyterhub[216606]: File “/opt/jupyterhub/lib64/python3.8/site-packages/jupyterhub/utils.py”, line 184, in exponential_backoff
Feb 01 17:12:27 gen-datasci jupyterhub[216606]: raise TimeoutError(fail_message)
Feb 01 17:12:27 gen-datasci jupyterhub[216606]: TimeoutError: Server at weblink://gen-datasci:0/user/me/ didn’t respond in 180 seconds
Feb 01 17:12:27 gen-datasci jupyterhub[216606]: [I 2023-02-01 17:12:27.291 JupyterHub log:189] 200 GET /hub/api/users/me/server/progress (me@::ffff:127.0.0.1) 197804.03ms
Feb 01 17:13:57 gen-datasci jupyterhub[216606]: 17:13:57.409 [ConfigProxy] info: 200 GET /api/routes
Feb 01 17:13:57 gen-datasci jupyterhub[216606]: [I 2023-02-01 17:13:57.410 JupyterHub proxy:347] Checking routes


So what I see is that slurm launched jupyterlab interface on port 60073 but for some reason jupyterhub is expecting it on port 0? I’ve been searching and found one other person who had a similar issue but fixed it by adding the c.ProfilesSpawner.ip = ‘0.0.0.0’ line. I have that line and it clearly is not making a difference.

I just don’t seem to have enough experience to track down the issue. Has anyone seen this?

I found a thread on github wrapspawner repository in issues from last July that indicated there was a code branch at one point that broke wrapspawner in exactly the manner I was seeing. I checked the wrapspwawner version currently on github and it was older so I downloaded again and removed old, installed this one. Now things are working. Not sure what happened but glad I figured it out.