[I JupyterHub] User logged in: phd22_01
[I batchspawner] Job submitted. ID=132
[W JupyterHub] User phd22_01 is slow to start (timeout=10)
[I JupyterHub] Redirecting to /hub/spawn-pending/phd22_01
What I Tried
Container starts, logs are correct
Port and token are consistent
notify("ready") explicitly called
Host port mapped correctly
Cannot curl from host to container
Stuck on /hub/spawn-pending page forever
Questions
Why does JupyterHub fail to detect the server is ready, even when container is healthy and running?
Is there any misconfiguration in how self.ip, self.port, base_url are passed?
Is notify("ready") enough, or does something else need to be triggered?
Is this a Docker port binding / networking isolation issue?
Should we switch to host networking for test?
Thank you!
Any advice, experience, or troubleshooting tips would be greatly appreciated! I will update the post once it’s resolved for others’ reference.
import os
import asyncio
import docker
from traitlets import Unicode
from batchspawner import SlurmSpawner
from tornado.httpclient import AsyncHTTPClient, HTTPRequest
from jupyterhub.utils import exponential_backoff
class DockerSlurmSpawner(SlurmSpawner):
log_file = Unicode(config=True, help="Spawner log file")
async def start(self):
self.log_file = f"/home/{self.user.name}/spawner-debug.log"
with open(self.log_file, "a") as f:
f.write(f"[START] User: {self.user.name}, starting SlurmSpawner
")
self.log.info(f"🔁 SlurmSpawner spawning: user={self.user.name}")
await super().start()
port_file = f"/home/{self.user.name}/.jupyter_port"
for _ in range(60):
if os.path.exists(port_file):
try:
with open(port_file) as f:
for line in f:
if line.startswith("PORT="):
self.port = int(line.strip().split("=")[1])
break
if self.port:
break
except Exception as e:
self.log.error(f"[ERROR] Failed to read port file: {e}")
with open(self.log_file, "a") as f:
f.write(f"[ERROR] Failed to read port file: {e}
")
await asyncio.sleep(0.5)
if not self.port:
msg = f"[ERROR] Could not read port from {port_file}"
self.log.error(msg)
with open(self.log_file, "a") as f:
f.write(f"[ERROR] {msg}
")
raise RuntimeError(msg)
self.ip = "127.0.0.1"
self.base_url = f"/user/{self.user.name}/"
self.user.server.port = self.port
self.user.server.ip = self.ip
with open(self.log_file, "a") as f:
f.write(f"[INFO] Set IP={self.ip}, PORT={self.port}, BASE_URL={self.base_url}
")
await exponential_backoff(
lambda: self._check_if_running(),
timeout=60,
fail_message="[ERROR] Container did not become available in time"
)
self.ready = True
self.notify("ready")
self.log.info(f"✅ Notebook started: http://{self.ip}:{self.port}{self.base_url}")
with open(self.log_file, "a") as f:
f.write(f"[SUCCESS] Notebook available: http://{self.ip}:{self.port}{self.base_url}
")
return {
"ip": self.ip,
"port": self.port,
"base_url": self.base_url,
"token": self.api_token,
}
def _read_port_from_file(self):
port_file = f"/home/{self.user.name}/.jupyter_port"
if os.path.exists(port_file):
with open(port_file) as f:
for line in f:
if line.startswith("PORT="):
return int(line.strip().split("=")[1])
return None
async def _check_if_running(self):
try:
port = self._read_port_from_file()
if not port:
raise RuntimeError("Port not found")
url = f"http://127.0.0.1:{port}{self.user.server.base_url}api"
self.log.info(f"🔍 Checking server availability: {url}")
with open(self.log_file, "a") as f:
f.write(f"[CHECK] Request: {url}
")
f.write(f"[CHECK] Token: {self.api_token[:8]}...
")
client = AsyncHTTPClient()
req = HTTPRequest(
url=url,
headers={"Authorization": f"token {self.api_token}"},
request_timeout=3,
)
resp = await client.fetch(req)
with open(self.log_file, "a") as f:
f.write(f"[CHECK OK] Response code: {resp.code}
")
return True
except Exception as e:
msg = f"[CHECK FAIL] Server status check failed: {e}"
self.log.warning(msg)
with open(self.log_file, "a") as f:
f.write(msg + "
")
return False
async def poll(self):
container_name = f"jupyter-{self.user.name}-{self.job_id}"
def check_container():
try:
client = docker.from_env()
container = client.containers.get(container_name)
container.reload()
status = container.status
with open(self.log_file, "a") as f:
f.write(f"[POLL] Container status: {status}
")
return None if status == "running" else 1
except docker.errors.NotFound:
return 1
except Exception as e:
msg = f"[POLL ERROR] Poll failed: {e}"
with open(self.log_file, "a") as f:
f.write(msg + "
")
return 1
return await asyncio.get_event_loop().run_in_executor(None, check_container)
2. Container Startup Script: start_notebook.sh
#!/bin/bash
set -e
USER=$(whoami)
IMAGE_REPO=jupyterlab-gpu
IMAGE_TAG=${IMAGE_TAG:-latest}
CONTAINER_NAME=jupyter-${USER}-${SLURM_JOB_ID:-manual}
PORT=$(comm -23 <(seq 8800 8899) <(ss -tln | awk 'NR>1{print $4}' | sed 's/.*://') | shuf | head -n 1)
echo "PORT=$PORT" > "/home/${USER}/.jupyter_port"
chmod 644 "/home/${USER}/.jupyter_port"
# Create home directory if missing
if [ ! -d "/home/$USER" ]; then
sudo mkdir -p "/home/$USER"
sudo chown "$(id -u)":"$(id -g)" "/home/$USER"
sudo chmod 755 "/home/$USER"
fi
docker run -d --name "$CONTAINER_NAME" --gpus all --shm-size=2g -e NB_USER="$USER" -e NB_UID="$(id -u)" -e NB_GID="$(id -g)" -e TOKEN="${JUPYTERHUB_API_TOKEN}" -e PORT="$PORT" -e JUPYTERHUB_API_URL="http://172.16.8.73:8082/hub/api" -e JUPYTERHUB_BASE_URL="/" -e JUPYTERHUB_SERVICE_PREFIX="/user/${USER}/" -v "/home/$USER:/home/$USER" -w "/home/$USER" -p "$PORT:8888" "$IMAGE_REPO:$IMAGE_TAG"
3. JupyterHub Startup Log (Sanitized)
For full trace and debug lines including HTTP requests, proxy bindings, and user login flow, see the attached systemd log (extracted using journalctl -u jupyterhub -b -n 300 -o short-iso).
Summary:
Hub starts cleanly on 0.0.0.0:8088 with internal API 172.16.8.73:8082
Proxy is external (127.0.0.1:8001)
User phd22_01 logs in and submits a Slurm job
Container jupyter-phd22_01-132 starts and becomes healthy on port 8888
.jupyter_port contains PORT=8888, but JupyterHub tries to poll 127.0.0.1:<random_port>/user/.../api which fails
As a result, the spawn hangs on spawn-pending
See the main thread for context, architecture, and config.
But you may want self.ip = '0.0.0.0', which sets the bind ip. It mustn’t be 127.0.0.1, which means it can’t be connected to from outside the container.
notify("ready") is not a thing, so this will only raise.
I’m not sure where this comes from (perhaps an llm coding tool?), but this is not what start expects to return. It should return a URL as a string. The URL should be the connect URL for the container, as seen from the hub’s machine. With containers, there are many answers to this question, and it depends on many things, e.g.:
the hub is in a container on the same network (use container name and internal port)
the hub is outside the container network on the same machine (use localhost and forwarded port)
the hub is outside the container network on a different machine (use node hostname and forwarded port)
etc.
The container should launch jupyterhub-singleuser, not jupyter lab. It also shouldn’t specify base_url or token or probably ip or port ont he command-line.