SwarmSpawner spawns a GPU-enabled image but


I am trying to figure out how to configure correctly my SwarmSpawner to be able to spawn a GPU-enabled single-user image.

I have a swarm of GPU-capable worker nodes. Everything works fine but when the GPU-enabled image is spawned, it does not see the GPUs and nvidia-smi is not available.
When I run the docker container manually on the node it works I can see the GPU and nvidia-smi is available. The problem comes from the configuration of DockerSpawner or SwarmSpawner I believe.

The image I use comes from GitHub - iot-salzburg/gpu-jupyter: GPU-Jupyter: Leverage the flexibility of Jupyterlab through the power of your NVIDIA GPU to run your code from Tensorflow and Pytorch in collaborative notebooks on the GPU..
I pull the image cschranz/gpu-jupyter:v1.6_cuda-12.0_ubuntu-22.04.
If I run this command on a worker node manually:

docker run --gpus all -d -it -p 80:8888 -v $(pwd)/data:/home/jovyan/work -e GRANT_SUDO=yes -e JUPYTER_ENABLE_LAB=yes --user root cschranz/gpu-jupyter:v1.6_cuda-12.0_ubuntu-22.04

I can access the Jupyter interface and nvidia-smi command exists and has the correct output.

What am I missing ?

Here are the docker-compose.yaml and jupyterhub_config.py files:

version: "3.9"
    hostname: jupyterhub
      mode: replicated
      replicas: 1
        condition: on-failure
        delay: 30s
        window: 300s
      # Ensure that we execute on a Swarm manager
          - node.role == manager
      - jupyterhub-net
      - "8080:8000"
      DOCKER_NETWORK_NAME: jupyterhub_network
      - source: jupyter-conf
        target: /srv/jupyterhub/jupyterhub_config.py
        mode: 0500
      - /var/run/docker.sock:/var/run/docker.sock
      - /etc/hostname:/etc/hostname

    external: true
    name: jupyterhub_config.py

    driver: overlay
    attachable: true
    name: jupyterhub_network

and the config file:

import docker
import os
import shlex
from shutil import chown
import socket

from dockerspawner import SwarmSpawner
from docker.types import Mount
from ldap3 import Server, Connection, ALL

worker_hostname = ""

class CustomSwarmSpawner(SwarmSpawner):
    Custom SwarmSpawner class

    worker_hostname = ""

    def get_worker_hostname(self):
        return self.worker_hostname

    def mounts(self):
        Define mounts for the container.
        if len(self.volume_binds):
            driver = self.mount_driver_config
            return [
                for host_loc, vol in self.volume_binds.items()

            return []

    def _options_form_default(self):
        cpu_options = [1, 10, 20, 30, 40, 48]  # Possible values for CPUs
        memory_options = [1, 2, 4, 8, 16, 32, 64]  # Possible values for memory in GB
        form_template = """
        <div class="form-group">
          <label for="args">Choisissez le type de serveur</label>
          <select id="args" class="form-control" name="args">
            <option value="lab">Jupyter Lab</option>
            <option value="notebook">Notebook classique</option>
        <div class="form-group">
            <label for="stack">Choisissez une image</label>
            <select id="stacks" class="form-control" name="stack">
                <option value="cschranz/gpu-jupyter:v1.6_cuda-12.0_ubuntu-22.04">Tensorflow/Pytorch GPU</option>
                <option value="jupyter/minimal-notebook:x86_64-python-3.11.6">Minimal</option>
                <option value="quay.io/jupyter/datascience-notebook">Datascience</option>
                <option value="quay.io/jupyter/r-notebook">R</option>
        <div class="form-group">
            <label for="hostname">Choisissez la machine sur laquelle vous souhaitez lancer l'image</label>
            <select id="hostname" class="form-control" name="hostname">
                <option value="host1">host1 - GPU: RTX A5000 / CPU: 40</option>
                <option value="host2">host2 - GPU: RTX A5000 / CPU: 40</option>
                <option value="host3">host3 - GPU: RTX A5000 / CPU: 40</option>
                <option value="host4">host4 - GPU: RTX 3090 / CPU: 48</option>
                <option value="host5">host5 - GPU: RTX 2080 / CPU: 48</option>
                <option value="host6">host6 - GPU: RTX 2080 Super / CPU: 40</option>
        <label for="cpu_limit">Nombre de CPUs:</label>
        <select class="form-control" name="cpu_limit" id="cpu_limit">
        <label for="memory_limit">Mémoire RAM (GiB):</label>
        <select class="form-control" name="memory_limit" id="memory_limit">
        # Populate the CPU options
        cpu_select = "\n".join([f"<option>{cpu}</option>" for cpu in cpu_options])
        # Populate the memory options
        memory_select = "\n".join([f"<option>{memory}</option>" for memory in memory_options])

        # Render the final form
        options_form = form_template.format(cpu_options=cpu_select, memory_options=memory_select)
        return options_form

    def options_from_form(self, formdata):
        global worker_hostname

        """Override to parse form submission"""
        options = {}
        # Extract selected Docker image, node, and wether to launch JupyterLab or simple notebook
        options['stack'] = formdata.get('stack', [''])[0].strip()
        options['hostname'] = formdata.get('hostname', [''])[0].strip()
        options['mem_limit'] = formdata.get('memory_limit', [''])[0].strip() + "G"
        options['cpu_limit'] = int(formdata.get('cpu_limit', [''])[0].strip())
        lab_or_notebook = formdata.get('args', [''])[0].strip()
        if lab_or_notebook == "lab":
            self.default_url = '/lab'
            self.default_url = '/tree'
        self.log.info(f"Selected image: {options['stack']} on machine: {options['hostname']}")
        self.image = options['stack']
        self.extra_placement_spec = { 'constraints' : ['node.hostname==' + options['hostname']] }
        self.worker_hostname = options['hostname']
        ## TODO: check availability of resources
        self.mem_limit = options['mem_limit']
        self.cpu_limit = options['cpu_limit']
        return options

def pre_spawn_custom_hook(SwarmSpawner):
    This function is run before the singleuser image is launched.

    * Retrieve user uid and gid grom LDAP server
    * Mount local user home directory into /home/jovyan

    server = Server("", get_info=ALL)
    conn = Connection(server, auto_bind=True)
    conn.search('ou=Users,dc=XX,dc=XX,dc=XX', f'(&(objectclass=posixAccount)(uid={SwarmSpawner.user.name}))', attributes=['cn', 'givenName', 'uidNumber', 'gidNumber'])

    SwarmSpawner.environment = {
        'GRANT_SUDO': 'yes',
        'NB_USER': SwarmSpawner.user.name,
        'NB_UID': conn.entries[0].uidNumber.value,
        'NB_GID': conn.entries[0].gidNumber.value,

    mounts = [
               Mount(type = 'bind',
                     source = os.path.join('/home', SwarmSpawner.get_worker_hostname(), SwarmSpawner.user.name),
                     target = os.path.join(f'/home/{SwarmSpawner.user.name}'),
                     read_only = False

    SwarmSpawner.extra_container_spec = {
        'mounts': mounts,
        'hostname': SwarmSpawner.get_worker_hostname(),
        'user': '0',

    SwarmSpawner.notebook_dir = f'/home/{SwarmSpawner.user.name}'

# JupyterHub Configuration
c = get_config()

# JupyterHub settings
c.JupyterHub.port = 8000
c.JupyterHub.hub_ip = 'jupyterhub'
c.JupyterHub.hub_connect_ip = 'jupyterhub'
c.JupyterHub.hub_port = 8081
c.JupyterHub.cleanup_servers = False
c.JupyterHub.allow_named_servers = True
c.JupyterHub.admin_access = True
c.JupyterHub.spawner_class = CustomSwarmSpawner

# Authenticator settings
c.JupyterHub.authenticator_class = 'ldapauthenticator.LDAPAuthenticator'
c.Authenticator.admin_users = {'cretin'}
c.LDAPAuthenticator.server_port = 389
c.LDAPAuthenticator.server_address = 'X.X.X.X'
c.LDAPAuthenticator.lookup_dn = True
c.LDAPAuthenticator.use_ssl = False
c.LDAPAuthenticator.user_search_base = 'ou=Users,dc=XX,dc=XX,dc=XX'
c.LDAPAuthenticator.user_attribute = 'uid'
c.LDAPAuthenticator.lookup_dn_user_dn_attribute = 'uid'

# DockerSpawner settings
c.SwarmSpawner.pull_policy = 'ifnotpresent'
c.SwarmSpawner.remove_containers = True
c.SwarmSpawner.http_timeout = 300
c.SwarmSpawner.start_timeout = 300
c.SwarmSpawner.pre_spawn_hook = pre_spawn_custom_hook

# Logs
c.SwarmSpawner.debug = True
c.DockerSpawner.debug = True

# Network settings
network_name = os.environ['DOCKER_NETWORK_NAME']
c.SwarmSpawner.network_name = network_name
c.SwarmSpawner.extra_host_config = {'network_mode': network_name}

# Proxy settings
c.ConfigurableHTTPProxy.should_start = True

# Device requests for GPU
c.SwarmSpawner.extra_host_config = {
    "device_requests": [
            options={"--gpus": "all"}

# Monitoring settings
c.ResourceUseDisplay.track_cpu_percent = True
c.JupyterHub.authenticate_prometheus = False

# Timeout settings
c.Spawner.start_timeout = 90

I think this is about nvidia driver capabilities stuff, and maybe resolves by providing an env var with: compute,utility as value.

Thanks a lot for your answer and sorry for my delayed one !

It took me some time but I finally was able to make it work after much trial and error.

If anyone is interested, here is what I changed:

SwarmSpawner.environment = {
        'NVIDIA_DRIVER_CAPABILITIES': 'compute,utility',
        'NVIDIA_VISIBLE_DEVICES': 'all',
        'GRANT_SUDO': 'yes'

On worker nodes, install and configure nvidia-docker-toolkit:
At the end run:

nvidia-ctk runtime configure --runtime=docker

This changes the file /etc/docker/daemon.json.

Also add "default-runtime": "nvidia" to the daemon json file.

Here is what my /etc/docker/daemon.json looks like at the end:

    "live-restore": false,
    "default-runtime": "nvidia",
    "runtimes": {
        "nvidia": {
            "args": [],
            "path": "nvidia-container-runtime"
1 Like

If you’re specifying GPU capabilities when spawning, keep an eye on your docker version: docker service create doesn't work when network and generic-resource are both attached · Issue #44378 · moby/moby · GitHub

Yes all servers are at v25.0.3 thanks