❗ JupyterHub + Slurm + Docker: Stuck on `/spawn-pending`, container is running and healthy

pyyang · April 9, 2025, 5:47pm

Background

We’re deploying a GPU multi-user computing service using JupyterHub + Slurm + Docker, with the following architecture:

Component	Details
JupyterHub	5.2.1, running under systemd, config: `/home/admin/jupyterhub_config.py`
Spawner	Custom `DockerSlurmSpawner`, inherits from `SlurmSpawner`, reads `.jupyter_port` for port info
Slurm	Single-node cluster, GPU partition `debug`, users submit batch jobs via Slurm
Docker	Image based on CUDA 12.2 + Ubuntu 22.04 with conda + JupyterLab preinstalled
Scripts	Host-side: `start_notebook.sh`; Container-side: `start-lab-inner.sh`
Networking	Port `8888` exposed inside container, mapped to host random port (e.g. `8824`)
User Flow	All users submit via Slurm, start container, and auto-launch JupyterLab inside it

Problem

The container starts successfully
JupyterLab is listening on 8888 and logs show correct startup
.jupyter_port contains correct value (PORT=8888)
But JupyterHub gets stuck on /hub/spawn-pending/USERNAME
Custom Spawner explicitly calls self.notify("ready"), but user server never transitions to “ready”
curl from host to container port fails (connection refused)

Key Configurations

Custom Spawner (DockerSlurmSpawner)

class DockerSlurmSpawner(SlurmSpawner):
    async def start(self):
        ...
        await super().start()
        ...
        for _ in range(60):
            if os.path.exists(port_file):
                with open(port_file) as f:
                    self.port = int(f.read().strip().split("=")[-1])
                break
        self.ip = "127.0.0.1"
        self.base_url = f"/user/{self.user.name}/"
        await exponential_backoff(self._check_if_running, timeout=60)
        self.ready = True
        self.notify("ready")
        return {
            "ip": self.ip,
            "port": self.port,
        }

    async def _check_if_running(self):
        url = f"http://127.0.0.1:{self.port}/user/{self.user.name}/api"
        ...

Host-side Launch Script: `start_notebook.sh`

#!/bin/bash
USER=$(whoami)
PORT=$(comm -23 <(seq 8800 8899) <(ss -tln | awk 'NR>1{print $4}' | sed 's/.*://') | shuf | head -n 1)
echo "PORT=$PORT" > "/home/$USER/.jupyter_port"

docker run -d --rm   --name jupyter-${USER}-${SLURM_JOB_ID:-manual}   -e NB_USER="$USER" -e NB_UID="$(id -u)" -e NB_GID="$(id -g)"   -e TOKEN="$JUPYTERHUB_API_TOKEN" -e PORT="$PORT"   -e JUPYTERHUB_SERVICE_PREFIX="/user/${USER}/"   -v "/home/$USER:/home/$USER"   -p "$PORT:8888"   jupyterlab-gpu:latest

Inside-container script: `start-lab-inner.sh`

#!/bin/bash
...
echo "PORT=$PORT" > "$HOME/.jupyter_port"
...
exec bash -c "jupyter lab --ip=0.0.0.0 --port=$PORT \
  --ServerApp.token="$TOKEN" \
  --ServerApp.allow_root=True \
  --ServerApp.base_url="/user/${NB_USER}/""

Symptoms

Container state is healthy:
```
docker ps
# 0.0.0.0:8879->8888/tcp
```

JupyterLab logs:

[I ServerApp] Jupyter Server running at:
http://127.0.0.1:8888/user/phd22_01/lab?token=...

.jupyter_port:
```
PORT=8888
```

curl from host to mapped port fails:

curl -H "Authorization: token <TOKEN>" http://127.0.0.1:8824/user/phd22_01/api
# -> Failed to connect

JupyterHub log:

[I JupyterHub] User logged in: phd22_01
[I batchspawner] Job submitted. ID=132
[W JupyterHub] User phd22_01 is slow to start (timeout=10)
[I JupyterHub] Redirecting to /hub/spawn-pending/phd22_01

What I Tried

Container starts, logs are correct
Port and token are consistent
notify("ready") explicitly called
Host port mapped correctly
Cannot curl from host to container
Stuck on /hub/spawn-pending page forever

Questions

Why does JupyterHub fail to detect the server is ready, even when container is healthy and running?
Is there any misconfiguration in how self.ip, self.port, base_url are passed?
Is notify("ready") enough, or does something else need to be triggered?
Is this a Docker port binding / networking isolation issue?
Should we switch to host networking for test?

Thank you!

Any advice, experience, or troubleshooting tips would be greatly appreciated! I will update the post once it’s resolved for others’ reference.

pyyang · April 9, 2025, 5:52pm

Full Configurations and Logs

1. Custom Spawner: `DockerSlurmSpawner.py`

import os
import asyncio
import docker
from traitlets import Unicode
from batchspawner import SlurmSpawner
from tornado.httpclient import AsyncHTTPClient, HTTPRequest
from jupyterhub.utils import exponential_backoff


class DockerSlurmSpawner(SlurmSpawner):
    log_file = Unicode(config=True, help="Spawner log file")

    async def start(self):
        self.log_file = f"/home/{self.user.name}/spawner-debug.log"
        with open(self.log_file, "a") as f:
            f.write(f"[START] User: {self.user.name}, starting SlurmSpawner
")

        self.log.info(f"🔁 SlurmSpawner spawning: user={self.user.name}")
        await super().start()

        port_file = f"/home/{self.user.name}/.jupyter_port"
        for _ in range(60):
            if os.path.exists(port_file):
                try:
                    with open(port_file) as f:
                        for line in f:
                            if line.startswith("PORT="):
                                self.port = int(line.strip().split("=")[1])
                                break
                    if self.port:
                        break
                except Exception as e:
                    self.log.error(f"[ERROR] Failed to read port file: {e}")
                    with open(self.log_file, "a") as f:
                        f.write(f"[ERROR] Failed to read port file: {e}
")
            await asyncio.sleep(0.5)

        if not self.port:
            msg = f"[ERROR] Could not read port from {port_file}"
            self.log.error(msg)
            with open(self.log_file, "a") as f:
                f.write(f"[ERROR] {msg}
")
            raise RuntimeError(msg)

        self.ip = "127.0.0.1"
        self.base_url = f"/user/{self.user.name}/"
        self.user.server.port = self.port
        self.user.server.ip = self.ip

        with open(self.log_file, "a") as f:
            f.write(f"[INFO] Set IP={self.ip}, PORT={self.port}, BASE_URL={self.base_url}
")

        await exponential_backoff(
            lambda: self._check_if_running(),
            timeout=60,
            fail_message="[ERROR] Container did not become available in time"
        )

        self.ready = True
        self.notify("ready")
        self.log.info(f"✅ Notebook started: http://{self.ip}:{self.port}{self.base_url}")
        with open(self.log_file, "a") as f:
            f.write(f"[SUCCESS] Notebook available: http://{self.ip}:{self.port}{self.base_url}
")

        return {
            "ip": self.ip,
            "port": self.port,
            "base_url": self.base_url,
            "token": self.api_token,
        }

    def _read_port_from_file(self):
        port_file = f"/home/{self.user.name}/.jupyter_port"
        if os.path.exists(port_file):
            with open(port_file) as f:
                for line in f:
                    if line.startswith("PORT="):
                        return int(line.strip().split("=")[1])
        return None

    async def _check_if_running(self):
        try:
            port = self._read_port_from_file()
            if not port:
                raise RuntimeError("Port not found")

            url = f"http://127.0.0.1:{port}{self.user.server.base_url}api"
            self.log.info(f"🔍 Checking server availability: {url}")
            with open(self.log_file, "a") as f:
                f.write(f"[CHECK] Request: {url}
")
                f.write(f"[CHECK] Token: {self.api_token[:8]}...
")

            client = AsyncHTTPClient()
            req = HTTPRequest(
                url=url,
                headers={"Authorization": f"token {self.api_token}"},
                request_timeout=3,
            )
            resp = await client.fetch(req)

            with open(self.log_file, "a") as f:
                f.write(f"[CHECK OK] Response code: {resp.code}
")
            return True

        except Exception as e:
            msg = f"[CHECK FAIL] Server status check failed: {e}"
            self.log.warning(msg)
            with open(self.log_file, "a") as f:
                f.write(msg + "
")
            return False

    async def poll(self):
        container_name = f"jupyter-{self.user.name}-{self.job_id}"

        def check_container():
            try:
                client = docker.from_env()
                container = client.containers.get(container_name)
                container.reload()
                status = container.status
                with open(self.log_file, "a") as f:
                    f.write(f"[POLL] Container status: {status}
")
                return None if status == "running" else 1
            except docker.errors.NotFound:
                return 1
            except Exception as e:
                msg = f"[POLL ERROR] Poll failed: {e}"
                with open(self.log_file, "a") as f:
                    f.write(msg + "
")
                return 1

        return await asyncio.get_event_loop().run_in_executor(None, check_container)

2. Container Startup Script: `start_notebook.sh`

#!/bin/bash
set -e

USER=$(whoami)
IMAGE_REPO=jupyterlab-gpu
IMAGE_TAG=${IMAGE_TAG:-latest}
CONTAINER_NAME=jupyter-${USER}-${SLURM_JOB_ID:-manual}
PORT=$(comm -23 <(seq 8800 8899) <(ss -tln | awk 'NR>1{print $4}' | sed 's/.*://') | shuf | head -n 1)

echo "PORT=$PORT" > "/home/${USER}/.jupyter_port"
chmod 644 "/home/${USER}/.jupyter_port"

# Create home directory if missing
if [ ! -d "/home/$USER" ]; then
  sudo mkdir -p "/home/$USER"
  sudo chown "$(id -u)":"$(id -g)" "/home/$USER"
  sudo chmod 755 "/home/$USER"
fi

docker run -d   --name "$CONTAINER_NAME"   --gpus all   --shm-size=2g   -e NB_USER="$USER"   -e NB_UID="$(id -u)"   -e NB_GID="$(id -g)"   -e TOKEN="${JUPYTERHUB_API_TOKEN}"   -e PORT="$PORT"   -e JUPYTERHUB_API_URL="http://172.16.8.73:8082/hub/api"   -e JUPYTERHUB_BASE_URL="/"   -e JUPYTERHUB_SERVICE_PREFIX="/user/${USER}/"   -v "/home/$USER:/home/$USER"   -w "/home/$USER"   -p "$PORT:8888"   "$IMAGE_REPO:$IMAGE_TAG"

3. JupyterHub Startup Log (Sanitized)

For full trace and debug lines including HTTP requests, proxy bindings, and user login flow, see the attached systemd log (extracted using journalctl -u jupyterhub -b -n 300 -o short-iso).

Summary:

Hub starts cleanly on 0.0.0.0:8088 with internal API 172.16.8.73:8082
Proxy is external (127.0.0.1:8001)
User phd22_01 logs in and submits a Slurm job
Container jupyter-phd22_01-132 starts and becomes healthy on port 8888
.jupyter_port contains PORT=8888, but JupyterHub tries to poll 127.0.0.1:<random_port>/user/.../api which fails
As a result, the spawn hangs on spawn-pending

See the main thread for context, architecture, and config.

minrk · April 23, 2025, 7:28am

Typically this means the URL is not correct.

Yes, you likely don’t want to set any of these:

But you may want self.ip = '0.0.0.0', which sets the bind ip. It mustn’t be 127.0.0.1, which means it can’t be connected to from outside the container.

notify("ready") is not a thing, so this will only raise.

I’m not sure where this comes from (perhaps an llm coding tool?), but this is not what start expects to return. It should return a URL as a string. The URL should be the connect URL for the container, as seen from the hub’s machine. With containers, there are many answers to this question, and it depends on many things, e.g.:

the hub is in a container on the same network (use container name and internal port)
the hub is outside the container network on the same machine (use localhost and forwarded port)
the hub is outside the container network on a different machine (use node hostname and forwarded port)
etc.

The container should launch jupyterhub-singleuser, not jupyter lab. It also shouldn’t specify base_url or token or probably ip or port ont he command-line.

Nota · May 8, 2025, 8:38pm

Hi there! I’m having the same issue where JupyterHub fails to detect the server is ready. I saw that you said that the URL wasn’t correct. Which URL is this? And where do I change variable? Thank you.

Topic		Replies	Views
Separating JupyterHub and Slurm using LXC containers JupyterHub jupyterhub , help-wanted , hpc	8	473	July 26, 2024
Running singleuser from JupyterHUb via Slurm does not connect Hub JupyterHub help-wanted , hpc	1	840	July 12, 2023
JupyterHub and Slurm JupyterHub jupyterhub , help-wanted , hpc	7	3673	September 13, 2023
Set up on SLURM JupyterHub help-wanted , hpc	3	3693	June 29, 2023
Switch to Hub/Lab4: Jupyterhub and spawner Jupyterlab container can't communicate properly: "No user identified" JupyterHub	5	698	April 22, 2024