Integrate JupyterHub with Slurm, batchspawner-singleuser issue

Hi,

I’ve been going round in circles and could do with some help. I’m completely new to the world of JupyterHub/Jupyter Notebooks, and at times out of my depth. I’m looking to integrate JupyterHub with Slurm to support a computing teaching project.

I’m currently running a JupyterHub Server on one machine and a Slurm master, also acting as compute node on another. Both machines are running ‘Debian GNU/Linux 12 (bookworm)’.

From reading around, what I understand is the desired outcome is the following:

  1. User logs in to the JupyterHub server via their web browser.

  2. User clicks “Start Server” on the spawn page, optionally selecting resources
    (cores, memory, etc.). This sends a request to the JupyterHub server.

  3. JupyterHub Spawns the Job: The SlurmSpawner on the JupyterHub server takes
    the user’s choices, combines them with the batch_script template, and executes
    sbatch to submit the job to the SLURM scheduler.

  4. SLURM Schedules the Job: The SLURM scheduler receives the job and places it
    in the queue.

  5. Job Execution: Once the job reaches the top of the queue, SLURM allocates a
    compute node and runs the batch_script. The script then starts the
    jupyterhub-singleuser server.

  6. Server Communication: The newly started jupyterhub-singleuser server on the
    compute node communicates back to the JupyterHub server’s API (10.136.9.2:8082)
    to announce that it’s ready.

  7. User is Redirected: The JupyterHub server receives the “ready” signal and
    redirects the user’s browser to the newly created single-user notebook.

The final step for the user is seeing their Jupyter environment running.

However, what I see in the browser is:

Your server is starting up.
You will be redirected automatically when it’s ready for you.

Server requested
Cluster job running… waiting to connect
Spawn failed: Server at http://phy-jhubcomp:48505/user/hs243/api didn’t respond in 30 seconds

The problem showed up in this line from the SLURM log:

slurmstepd-PHY-JHUBCOMP: error: execve(): batchspawner-singleuser: No such file or directory

so although the SLURM Job started successfully, the Spawner mechanism is attempting to run a command called batchspawner-singleuser, and obviously the system cannot find that file in any directory listed in the compute node’s $PATH environment variable.

I had this line in my JupyterHub config:

c.BatchSpawner.cmd = ['jupyterhub-singleuser']

and what AI tells me the following:

However, the batchspawner library often internally wraps the command defined in c.BatchSpawner.cmd with a helper script called batchspawner-singleuser. This helper script is designed to set up the environment and then execute the actual jupyterhub-singleuser server.
The error means that this helper script, which should be installed in your environment, is missing or inaccessible to the user submitting the job.

I tried removing reference to it using:

c.BatchSpawner.cmd = ['false']

but AI told me that it was best to retain it, and commented that out.

At this point I’ve wasted a lot of time dealing with the issue that batchspawner sets a default command if one isn’t explicitly set, and simply commenting out or setting it to None is not overriding the library’s internal default. I’ve tried forcing the actual command executed by the srun wrapper inside my script to be the correct one, bypassing the batchspawner-singleuser wrapper entirely.

Since the batchspawner seems to ignore my attempts to remove its command, I’ve tried to make explicit the c.BatchSpawner.batch_script the definitive source for running the server, completely ignoring whatever srun command is defaulted by the spawner, but it keeps defaulting to:

srun batchspawner-singleuser jupyterhub-singleuser

so I’ve been banging my head with the recurring problem is definitively related to the batchspawner library injecting the non-existent batchspawner-singleuser wrapper into my execution command, overriding my script’s logic.

I’ve installed batchspawner-singleuser on the compute node:

$ which batchspawner-singleuser
/opt/jupyterhub_envs/shared_conda_env/bin/batchspawner-singleuser

but despite explicitly setting the full-path to it, I’m forced to make a softlink to it in /usr/local/bin, but this generates errors in itself as needs additional stuff from the install location, so am now trying to write a wrapper in /usr/local/bin, but that continues to generate errors:

$ cat /usr/local/bin/batchspawner-singleuser-wrapper
#!/bin/bash

# Activate the shared Conda environment
source /opt/conda/bin/activate /opt/jupyterhub_envs/shared_conda_env

# Run the actual batchspawner-singleuser script with all passed arguments
exec /opt/jupyterhub_envs/shared_conda_env/bin/batchspawner-singleuser "$@"

I now feel I’m disappearing down the rabbit hole getting nowhere.

Is the problem really due to issues with batchspawner-singleuser installed on the compute node, or is my JupyterHub config file poorly contructed - here’s the key part of it:

from batchspawner import SlurmSpawner

c.JupyterHub.spawner_class = 'batchspawner.SlurmSpawner'

c.Spawner.start_timeout = 300  # Sets timeout to 5 minutes (300 seconds)

c.JupyterHub.hub_api_url = 'http://10.136.9.2:8082/hub/api'

#The default batch_script does not keep everything from environment and hence drops
#the PATH variable (and others), but by using req_keepvars_extra one can re-add ALL
#(or at least the PATH variable).
c.SlurmSpawner.req_keepvars_extra = 'ALL'

# This is the full, explicit SLURM batch script. It's the most reliable way to ensure all commands are executed.
c.BatchSpawner.batch_script = """#!/bin/bash
#SBATCH --output={homedir}/jupyterhub_slurm_logs/jupyterhub_slurmspawner_%j.log
#SBATCH --job-name=spawner-jupyterhub
#SBATCH --chdir={homedir}
#SBATCH --export=ALL
#SBATCH --get-user-env=L
#SBATCH {options}

#echo "--- Environment check ---"
#printenv
#echo "--- Environment check finished ---"

echo "--- Initial PATH ---"
printenv PATH

source /opt/conda/bin/activate /opt/jupyterhub_envs/shared_conda_env

echo "--- Post-Conda PATH ---"
printenv PATH

echo "--- Check executable location ---"
which jupyterhub-singleuser

# The single-user server command is now explicitly called.
# echo "--- Starting single-user server ---"
#srun jupyterhub-singleuser --ip=0.0.0.0 --port={port} --hub-api-url={hub_api_url} --hub-api-token={hub_api_token}

#srun /opt/jupyterhub_envs/shared_conda_env/bin/jupyterhub-singleuser --ip=0.0.0.0 --port={port} --hub-api-url={hub_api_url} --hub-api-token={hub_api_token}

# Was using one below
srun /usr/local/bin/batchspawner-singleuser-wrapper --hub-api-url=http://10.136.9.2:8082/hub/api \
  /opt/jupyterhub_envs/shared_conda_env/bin/jupyterhub-singleuser \
  --ip=0.0.0.0 \
  --port={port} \
  --hub-api-url={hub_api_url} \
  --hub-api-token={hub_api_token}

#srun /opt/jupyterhub/bin/batchspawner-singleuser /opt/jupyterhub_envs/shared_conda_env/bin/jupyterhub-singleuser --ip=0.0.0.0 --port={port} --hub-api-url={hub_api_url} --hub-api-token={hub_api_token}

#srun /usr/local/bin/batchspawner-singleuser --hub-api-url=http://10.136.9.2:8082/hub/api /opt/jupyterhub_envs/shared_conda_env/bin/jupyterhub-singleuser --ip=0.0.0.0 --port={port} --hub-api-url={hub_api_url} --hub-api-token={hub_api_token}

#srun bash -c "/opt/jupyterhub_envs/shared_conda_env/bin/batchspawner-singleuser --hub-api-url=http://10.136.9.2:8082/hub/api /opt/jupyterhub_envs/shared_conda_env/bin/jupyterhub-singleuser --ip=0.0.0.0 --port={port} --hub-api-url={hub_api_url} --hub-api-token={hub_api_token}"

#echo "--- Activating Conda Environment ---"
#bash -c "source /opt/conda/bin/activate /opt/jupyterhub_envs/shared_conda_env && echo '--- PATH After Activation ---' && printenv PATH && echo '--- Starting single-user server ---' && /opt/jupyterhub_envs/shared_conda_env/bin/batchspawner-singleuser --hub-api-url=http://10.136.9.2:8082/hub/api /opt/jupyterhub_envs/shared_conda_env/bin/jupyterhub-singleuser --ip=0.0.0.0 --port={port} --hub-api-url={hub_api_url} --hub-api-token={hub_api_token}"

echo "jupyterhub-singleuser ended gracefully"
"""

#c.BatchSpawner.cmd = ['jupyterhub-singleuser']
#c.BatchSpawner.cmd = None
c.BatchSpawner.cmd = ['/opt/jupyterhub_envs/shared_conda_env/bin/jupyterhub-singleuser']

# Spawner options presented to the user.
c.BatchSpawner.batch_spawner_config = {
    'singleuser': {
        'options': {
            'batch_cores': {
                'name': 'Number of Cores',
                'default': 1,
                'choices': [1, 2]
            },
            'batch_mem': {
                'name': 'Memory (GB)',
                'default': 2,
                'choices': [2]
            },
            'batch_time': {
                'name': 'Time Limit (HH:MM:SS)',
                'default': '01:00:00',
                'choices': ['01:00:00', '02:00:00', '04:00:00']
            },
            'batch_partition': {
                'name': 'SLURM Partition',
                'default': 'jupyter',
                'choices': ['jupyter']
            }
        },
        'options_to_cli': {
            'batch_cores': '--cpus-per-task',
            'batch_mem': '--mem={0}G',
            'batch_time': '--time',
            'batch_partition': '--partition'
        }
    }
}

c.BatchSpawner.batch_input_path = '/usr/bin/sbatch'
c.JupyterHub.internal_url = 'http://127.0.0.1:8001'
c.JupyterHub.extra_service_settings = {
    'batch_spawner_path': {
        'homedir_template': '/home/{username}'
    }
}
c.JupyterHub.log_level = 'DEBUG'
c.BatchSpawner.hub_api_token = 'xxxxxxxxxxxxxxxxxxxxx'

Apologies for the all the different srun commands, but those from all the various attempts to get out this doom loop.

Again, apologies for the length of the post, but any help would be greatly appreciated. I can send more of the config file if required.

yours,

hardip

1 Like

I created a wrapper script on the compute node:

# cat /usr/local/bin/batchspawner-singleuser-wrapper
#!/bin/bash

# Activate the shared Conda environment
source /opt/conda/bin/activate /opt/jupyterhub_envs/shared_conda_env

# Run the actual batchspawner-singleuser script with all passed arguments
exec /opt/jupyterhub_envs/shared_conda_env/bin/batchspawner-singleuser "$@"

In jupyterhub_config.py have:

srun /usr/local/bin/batchspawner-singleuser-wrapper --hub-api-url=http://10.136.9.2:8082/hub/api \
  /opt/jupyterhub_envs/shared_conda_env/bin/jupyterhub-singleuser \
  --ip=0.0.0.0 \
  --port={port} \
  --hub-api-url={hub_api_url} \
  --hub-api-token={hub_api_token}

When I connect to the Jupyterhub serve in the browser I get:

Server requested
Pending in queue...
Cluster job running... waiting to connect
Spawn failed: Server at http://phy-jhubcomp:37819/user/hs243/api didn't respond in 30 seconds

And log file created in my account shows:

$ cat jupyterhub_slurmspawner_98.log
Traceback (most recent call last):
  File "/usr/local/bin/batchspawner-singleuser", line 10, in <module>
    sys.exit(main())
  File "/opt/jupyterhub_envs/shared_conda_env/lib/python3.9/site-packages/batchspawner/singleuser.py", line 47, in main
    run_path(cmd_path, run_name="__main__")
  File "/opt/jupyterhub_envs/shared_conda_env/lib/python3.9/runpy.py", line 278, in run_path
    importer = get_importer(path_name)
  File "/opt/jupyterhub_envs/shared_conda_env/lib/python3.9/pkgutil.py", line 415, in get_importer
    path_item = os.fsdecode(path_item)
  File "/opt/jupyterhub_envs/shared_conda_env/lib/python3.9/os.py", line 822, in fsdecode
    filename = fspath(filename)  # Does type-checking of `filename`.
TypeError: expected str, bytes or os.PathLike object, not NoneType
srun: error: PHY-JHUBCOMP: task 0: Exited with exit code 1

Any idea of the way forward on this?