❗ JupyterHub + Slurm + Docker: Stuck on `/spawn-pending`, container is running and healthy

:white_check_mark: Background

We’re deploying a GPU multi-user computing service using JupyterHub + Slurm + Docker, with the following architecture:

Component Details
JupyterHub 5.2.1, running under systemd, config: /home/admin/jupyterhub_config.py
Spawner Custom DockerSlurmSpawner, inherits from SlurmSpawner, reads .jupyter_port for port info
Slurm Single-node cluster, GPU partition debug, users submit batch jobs via Slurm
Docker Image based on CUDA 12.2 + Ubuntu 22.04 with conda + JupyterLab preinstalled
Scripts Host-side: start_notebook.sh; Container-side: start-lab-inner.sh
Networking Port 8888 exposed inside container, mapped to host random port (e.g. 8824)
User Flow All users submit via Slurm, start container, and auto-launch JupyterLab inside it

:bug: Problem

  • The container starts successfully
  • JupyterLab is listening on 8888 and logs show correct startup
  • .jupyter_port contains correct value (PORT=8888)
  • But JupyterHub gets stuck on /hub/spawn-pending/USERNAME
  • Custom Spawner explicitly calls self.notify("ready"), but user server never transitions to “ready”
  • curl from host to container port fails (connection refused)

:brick: Key Configurations

:wrench: Custom Spawner (DockerSlurmSpawner)

class DockerSlurmSpawner(SlurmSpawner):
    async def start(self):
        ...
        await super().start()
        ...
        for _ in range(60):
            if os.path.exists(port_file):
                with open(port_file) as f:
                    self.port = int(f.read().strip().split("=")[-1])
                break
        self.ip = "127.0.0.1"
        self.base_url = f"/user/{self.user.name}/"
        await exponential_backoff(self._check_if_running, timeout=60)
        self.ready = True
        self.notify("ready")
        return {
            "ip": self.ip,
            "port": self.port,
        }

    async def _check_if_running(self):
        url = f"http://127.0.0.1:{self.port}/user/{self.user.name}/api"
        ...

:rocket: Host-side Launch Script: start_notebook.sh

#!/bin/bash
USER=$(whoami)
PORT=$(comm -23 <(seq 8800 8899) <(ss -tln | awk 'NR>1{print $4}' | sed 's/.*://') | shuf | head -n 1)
echo "PORT=$PORT" > "/home/$USER/.jupyter_port"

docker run -d --rm   --name jupyter-${USER}-${SLURM_JOB_ID:-manual}   -e NB_USER="$USER" -e NB_UID="$(id -u)" -e NB_GID="$(id -g)"   -e TOKEN="$JUPYTERHUB_API_TOKEN" -e PORT="$PORT"   -e JUPYTERHUB_SERVICE_PREFIX="/user/${USER}/"   -v "/home/$USER:/home/$USER"   -p "$PORT:8888"   jupyterlab-gpu:latest

:counterclockwise_arrows_button: Inside-container script: start-lab-inner.sh

#!/bin/bash
...
echo "PORT=$PORT" > "$HOME/.jupyter_port"
...
exec bash -c "jupyter lab --ip=0.0.0.0 --port=$PORT \
  --ServerApp.token="$TOKEN" \
  --ServerApp.allow_root=True \
  --ServerApp.base_url="/user/${NB_USER}/""

:clipboard: Symptoms

  • Container state is healthy:

    docker ps
    # 0.0.0.0:8879->8888/tcp
    
  • JupyterLab logs:

    [I ServerApp] Jupyter Server running at:
    http://127.0.0.1:8888/user/phd22_01/lab?token=...
    
  • .jupyter_port:

    PORT=8888
    
  • curl from host to mapped port fails:

    curl -H "Authorization: token <TOKEN>" http://127.0.0.1:8824/user/phd22_01/api
    # -> Failed to connect
    
  • JupyterHub log:

    [I JupyterHub] User logged in: phd22_01
    [I batchspawner] Job submitted. ID=132
    [W JupyterHub] User phd22_01 is slow to start (timeout=10)
    [I JupyterHub] Redirecting to /hub/spawn-pending/phd22_01
    

:magnifying_glass_tilted_left: What I Tried

  • :white_check_mark: Container starts, logs are correct
  • :white_check_mark: Port and token are consistent
  • :white_check_mark: notify("ready") explicitly called
  • :white_check_mark: Host port mapped correctly
  • :cross_mark: Cannot curl from host to container
  • :cross_mark: Stuck on /hub/spawn-pending page forever

:folded_hands: Questions

  1. Why does JupyterHub fail to detect the server is ready, even when container is healthy and running?
  2. Is there any misconfiguration in how self.ip, self.port, base_url are passed?
  3. Is notify("ready") enough, or does something else need to be triggered?
  4. Is this a Docker port binding / networking isolation issue?
  5. Should we switch to host networking for test?

:heart_hands: Thank you!

Any advice, experience, or troubleshooting tips would be greatly appreciated! I will update the post once it’s resolved for others’ reference.

Full Configurations and Logs

1. Custom Spawner: DockerSlurmSpawner.py

import os
import asyncio
import docker
from traitlets import Unicode
from batchspawner import SlurmSpawner
from tornado.httpclient import AsyncHTTPClient, HTTPRequest
from jupyterhub.utils import exponential_backoff


class DockerSlurmSpawner(SlurmSpawner):
    log_file = Unicode(config=True, help="Spawner log file")

    async def start(self):
        self.log_file = f"/home/{self.user.name}/spawner-debug.log"
        with open(self.log_file, "a") as f:
            f.write(f"[START] User: {self.user.name}, starting SlurmSpawner
")

        self.log.info(f"🔁 SlurmSpawner spawning: user={self.user.name}")
        await super().start()

        port_file = f"/home/{self.user.name}/.jupyter_port"
        for _ in range(60):
            if os.path.exists(port_file):
                try:
                    with open(port_file) as f:
                        for line in f:
                            if line.startswith("PORT="):
                                self.port = int(line.strip().split("=")[1])
                                break
                    if self.port:
                        break
                except Exception as e:
                    self.log.error(f"[ERROR] Failed to read port file: {e}")
                    with open(self.log_file, "a") as f:
                        f.write(f"[ERROR] Failed to read port file: {e}
")
            await asyncio.sleep(0.5)

        if not self.port:
            msg = f"[ERROR] Could not read port from {port_file}"
            self.log.error(msg)
            with open(self.log_file, "a") as f:
                f.write(f"[ERROR] {msg}
")
            raise RuntimeError(msg)

        self.ip = "127.0.0.1"
        self.base_url = f"/user/{self.user.name}/"
        self.user.server.port = self.port
        self.user.server.ip = self.ip

        with open(self.log_file, "a") as f:
            f.write(f"[INFO] Set IP={self.ip}, PORT={self.port}, BASE_URL={self.base_url}
")

        await exponential_backoff(
            lambda: self._check_if_running(),
            timeout=60,
            fail_message="[ERROR] Container did not become available in time"
        )

        self.ready = True
        self.notify("ready")
        self.log.info(f"✅ Notebook started: http://{self.ip}:{self.port}{self.base_url}")
        with open(self.log_file, "a") as f:
            f.write(f"[SUCCESS] Notebook available: http://{self.ip}:{self.port}{self.base_url}
")

        return {
            "ip": self.ip,
            "port": self.port,
            "base_url": self.base_url,
            "token": self.api_token,
        }

    def _read_port_from_file(self):
        port_file = f"/home/{self.user.name}/.jupyter_port"
        if os.path.exists(port_file):
            with open(port_file) as f:
                for line in f:
                    if line.startswith("PORT="):
                        return int(line.strip().split("=")[1])
        return None

    async def _check_if_running(self):
        try:
            port = self._read_port_from_file()
            if not port:
                raise RuntimeError("Port not found")

            url = f"http://127.0.0.1:{port}{self.user.server.base_url}api"
            self.log.info(f"🔍 Checking server availability: {url}")
            with open(self.log_file, "a") as f:
                f.write(f"[CHECK] Request: {url}
")
                f.write(f"[CHECK] Token: {self.api_token[:8]}...
")

            client = AsyncHTTPClient()
            req = HTTPRequest(
                url=url,
                headers={"Authorization": f"token {self.api_token}"},
                request_timeout=3,
            )
            resp = await client.fetch(req)

            with open(self.log_file, "a") as f:
                f.write(f"[CHECK OK] Response code: {resp.code}
")
            return True

        except Exception as e:
            msg = f"[CHECK FAIL] Server status check failed: {e}"
            self.log.warning(msg)
            with open(self.log_file, "a") as f:
                f.write(msg + "
")
            return False

    async def poll(self):
        container_name = f"jupyter-{self.user.name}-{self.job_id}"

        def check_container():
            try:
                client = docker.from_env()
                container = client.containers.get(container_name)
                container.reload()
                status = container.status
                with open(self.log_file, "a") as f:
                    f.write(f"[POLL] Container status: {status}
")
                return None if status == "running" else 1
            except docker.errors.NotFound:
                return 1
            except Exception as e:
                msg = f"[POLL ERROR] Poll failed: {e}"
                with open(self.log_file, "a") as f:
                    f.write(msg + "
")
                return 1

        return await asyncio.get_event_loop().run_in_executor(None, check_container)

2. Container Startup Script: start_notebook.sh

#!/bin/bash
set -e

USER=$(whoami)
IMAGE_REPO=jupyterlab-gpu
IMAGE_TAG=${IMAGE_TAG:-latest}
CONTAINER_NAME=jupyter-${USER}-${SLURM_JOB_ID:-manual}
PORT=$(comm -23 <(seq 8800 8899) <(ss -tln | awk 'NR>1{print $4}' | sed 's/.*://') | shuf | head -n 1)

echo "PORT=$PORT" > "/home/${USER}/.jupyter_port"
chmod 644 "/home/${USER}/.jupyter_port"

# Create home directory if missing
if [ ! -d "/home/$USER" ]; then
  sudo mkdir -p "/home/$USER"
  sudo chown "$(id -u)":"$(id -g)" "/home/$USER"
  sudo chmod 755 "/home/$USER"
fi

docker run -d   --name "$CONTAINER_NAME"   --gpus all   --shm-size=2g   -e NB_USER="$USER"   -e NB_UID="$(id -u)"   -e NB_GID="$(id -g)"   -e TOKEN="${JUPYTERHUB_API_TOKEN}"   -e PORT="$PORT"   -e JUPYTERHUB_API_URL="http://172.16.8.73:8082/hub/api"   -e JUPYTERHUB_BASE_URL="/"   -e JUPYTERHUB_SERVICE_PREFIX="/user/${USER}/"   -v "/home/$USER:/home/$USER"   -w "/home/$USER"   -p "$PORT:8888"   "$IMAGE_REPO:$IMAGE_TAG"

3. JupyterHub Startup Log (Sanitized)

For full trace and debug lines including HTTP requests, proxy bindings, and user login flow, see the attached systemd log (extracted using journalctl -u jupyterhub -b -n 300 -o short-iso).

:memo: Summary:

  • Hub starts cleanly on 0.0.0.0:8088 with internal API 172.16.8.73:8082
  • Proxy is external (127.0.0.1:8001)
  • User phd22_01 logs in and submits a Slurm job
  • Container jupyter-phd22_01-132 starts and becomes healthy on port 8888
  • .jupyter_port contains PORT=8888, but JupyterHub tries to poll 127.0.0.1:<random_port>/user/.../api which fails
  • As a result, the spawn hangs on spawn-pending

:information_source: See the main thread for context, architecture, and config.

Typically this means the URL is not correct.

Yes, you likely don’t want to set any of these:

But you may want self.ip = '0.0.0.0', which sets the bind ip. It mustn’t be 127.0.0.1, which means it can’t be connected to from outside the container.

notify("ready") is not a thing, so this will only raise.

I’m not sure where this comes from (perhaps an llm coding tool?), but this is not what start expects to return. It should return a URL as a string. The URL should be the connect URL for the container, as seen from the hub’s machine. With containers, there are many answers to this question, and it depends on many things, e.g.:

  • the hub is in a container on the same network (use container name and internal port)
  • the hub is outside the container network on the same machine (use localhost and forwarded port)
  • the hub is outside the container network on a different machine (use node hostname and forwarded port)
  • etc.

The container should launch jupyterhub-singleuser, not jupyter lab. It also shouldn’t specify base_url or token or probably ip or port ont he command-line.