Jupyterhub with docker + nvidia GPU

We cannot access a GPU in our docker container spawned with jupyterhub…

1.) The situation:

  • We are using a headless Ubuntu server 18.0.4 LTS. It has the following stuff installed:

  • miniconda 4.7.10 on a each user base (under each user dir a separate installation, so that each use has his own environments asf.) Maybe it is important to say, that I have different conda environments installed on the host for the uses which spawns the container. These evironments are accessible within the jupyter notebook through different installed kernels, that one may select when running the notebook. For our appication we use a kernel/env other than “base”.

  • We have installed jupyterhub version 1.0.0

  • the following nvidia packages are installed on the headles server:

    libnvidia-cfg1-430:amd64 430.40
    libnvidia-compute-430:amd64 430.40
    libnvidia-container-tools 1.0.5-1
    libnvidia-container1:amd64 1.0.5-1
    nvidia-compute-utils-430 430.40
    nvidia-container-toolkit 1.0.5-1
    nvidia-dkms-430 430.40
    nvidia-headless-418:amd64 430.40
    nvidia-headless-430 430.40
    nvidia-headless-no-dkms-430 430.40
    nvidia-kernel-common-430 430.40
    nvidia-kernel-source-430 430.40
    nvidia-utils-418:amd64 430.40
    nvidia-utils-430 430.40

  • Docker version 19.03.2, build 6a30dfc:

    dpkg -l | grep docker
    docker-ce 5:19.03.2~3-0~ubuntu-bionic
    docker-ce-cli 5:19.03.2~3-0~ubuntu-bionic

  • Systemuser spawner, as we want to enable to access to home directories of the users that are logged into jupyterhub

  • We have 4 Tesla-GPUS, thet we want to enable within 4 docker-containers on the jupyterhub. This means, only max 4 containers run with GPU enables (each one of the Teslas) and the other containers have no GPU. Here ist the info for the GPUs:

      nvidia-smi
      Tue Sep 17 09:04:50 2019
      +-----------------------------------------------------------------------------+
      | NVIDIA-SMI 430.40       Driver Version: 430.40       CUDA Version: 10.1     |
      |-------------------------------+----------------------+----------------------+
      | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
      | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
      |===============================+======================+======================|
      |   0  Tesla P100-PCIE...  Off  | 00000000:02:00.0 Off |                    0 |
      | N/A   32C    P0    32W / 250W |      0MiB / 16280MiB |      0%      Default |
      +-------------------------------+----------------------+----------------------+
      |   1  Tesla P100-PCIE...  Off  | 00000000:82:00.0 Off |                    0 |
      | N/A   33C    P0    31W / 250W |      0MiB / 16280MiB |      0%      Default |
      +-------------------------------+----------------------+----------------------+
      |   2  Tesla P100-PCIE...  Off  | 00000000:85:00.0 Off |                    0 |
      | N/A   36C    P0    32W / 250W |      0MiB / 16280MiB |      0%      Default |
      +-------------------------------+----------------------+----------------------+
      |   3  Tesla P100-PCIE...  Off  | 00000000:86:00.0 Off |                    0 |
      | N/A   34C    P0    27W / 250W |      0MiB / 16280MiB |      0%      Default |
      +-------------------------------+----------------------+----------------------+
    
  • Our Dockerfile:

FROM jupyter/tensorflow-notebook
USER root
RUN apt-get update && apt-get install -y --no-install-recommends \
gnupg2 curl ca-certificates && \
    curl -fsSL https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/7fa2af80.pub | apt-key add - && \
    echo "deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64 /" > /etc/apt/sources.list.d/cuda.list && \
    echo "deb https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64 /" > /etc/apt/sources.list.d/nvidia-ml.list && \
    apt-get purge --autoremove -y curl && \
rm -rf /var/lib/apt/lists/*
ENV CUDA_VERSION 10.1.243
ENV CUDA_PKG_VERSION 10-1=$CUDA_VERSION-1
# For libraries in the cuda-compat-* package: https://docs.nvidia.com/cuda/eula/index.html#attachment-a
RUN apt-get update && apt-get install -y --no-install-recommends \
        cuda-cudart-$CUDA_PKG_VERSION \
cuda-compat-10-1 && \
ln -s cuda-10.1 /usr/local/cuda && \
    rm -rf /var/lib/apt/lists/*
# Required for nvidia-docker v1
RUN echo "/usr/local/nvidia/lib" >> /etc/ld.so.conf.d/nvidia.conf && \
    echo "/usr/local/nvidia/lib64" >> /etc/ld.so.conf.d/nvidia.conf
ENV PATH /usr/local/nvidia/bin:/usr/local/cuda/bin:${PATH}
ENV LD_LIBRARY_PATH /usr/local/nvidia/lib:/usr/local/nvidia/lib64
# nvidia-container-runtime
ENV NVIDIA_VISIBLE_DEVICES all
ENV NVIDIA_DRIVER_CAPABILITIES compute,utility
ENV NVIDIA_REQUIRE_CUDA "cuda>=10.1 brand=tesla,driver>=384,driver<385 brand=tesla,driver>=396,driver<397 brand=tesla,driver>=410,driver<411"
ENV NCCL_VERSION 2.4.8
RUN apt-get update && apt-get install -y --no-install-recommends \
    cuda-libraries-$CUDA_PKG_VERSION \
cuda-nvtx-$CUDA_PKG_VERSION \
libnccl2=$NCCL_VERSION-1+cuda10.1 && \
    apt-mark hold libnccl2 && \
    rm -rf /var/lib/apt/lists/*
RUN apt-get update && apt-get install -y --no-install-recommends \
        cuda-nvml-dev-$CUDA_PKG_VERSION \
        cuda-command-line-tools-$CUDA_PKG_VERSION \
cuda-libraries-dev-$CUDA_PKG_VERSION \
        cuda-minimal-build-$CUDA_PKG_VERSION \
        libnccl-dev=$NCCL_VERSION-1+cuda10.1 \
&& \
    rm -rf /var/lib/apt/lists/*
ENV LIBRARY_PATH /usr/local/cuda/lib64/stubs
ENV CUDNN_VERSION 7.6.3.30
LABEL com.nvidia.cudnn.version="${CUDNN_VERSION}"
RUN apt-get update && apt-get install -y --no-install-recommends \
    libcudnn7=$CUDNN_VERSION-1+cuda10.1 \
libcudnn7-dev=$CUDNN_VERSION-1+cuda10.1 \
&& \
    apt-mark hold libcudnn7 && \
    rm -rf /var/lib/apt/lists/*
# Switch back to jovyan to avoid accidental container runs as root
USER $NB_UID

2.) My question:

Later, we want to be able to provide one GPU within 4 containers. At this time I have set up a container that is spawnable with jupyterhub and that should have GPU enabled:

# nvidia-container-runtime
ENV NVIDIA_VISIBLE_DEVICES all
ENV NVIDIA_DRIVER_CAPABILITIES compute,utility

But when I spawn it, and open a terminal in the jupyter menu, then input of nvidia-smi leads to no output like on the headless host. What is the reason for this?

If I manually spawn a container with GPU, nvidia-smi works within, here:

docker run --gpus all hhn/gpu-tensorflow-notebook:latest nvidia-smi

gives the exact same output from within the container as a nvidia-smi on the host above, see output above! Therefore I did not install nvidia-docker or nvidia-docker2, as I think it is deprecated?
Do I miss the installation of those or even other packages?

Furthermore, a

curl -s localhost:3476/docker/cli

leads to no output…

With verbose -v I get:

curl -v -s localhost:3476/docker/cli
*   Trying ::1...
* TCP_NODELAY set
* connect to ::1 port 3476 failed: Connection refused
*   Trying 127.0.0.1...
* TCP_NODELAY set
* connect to 127.0.0.1 port 3476 failed: Connection refused
* Failed to connect to localhost port 3476: Connection refused
* Closing connection 0

When I spawn the container and start a jupyter notebook, I can check if a GPU is present, by performing the following code within a code cell:

import math
import numpy as np
from numba import cuda
import matplotlib.pyplot as plt
%matplotlib inline

The I query number of visible GPUS with:

len(cuda.gpus)

This in turn leads to the error:

CudaSupportError: Error at driver init: 
CUDA driver library cannot be found.
If you are sure that a CUDA driver is installed,
try setting environment variable NUMBA_CUDA_DRIVER
with the file path of the CUDA driver shared library.

I think I am missing something, right?

On the other hand, when I check the cuda installation within the spawned container, I get:

nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Sun_Jul_28_19:07:16_PDT_2019
Cuda compilation tools, release 10.1, V10.1.243
(base) rschaufler@7c77f38e42e5:~$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Sun_Jul_28_19:07:16_PDT_2019
Cuda compilation tools, release 10.1, V10.1.243

If you need some more data from my machine, please let me know , I will provide them asap,
thanks an regards
rschauf

Maybe @mrocklin would be interested in this question?

3.) Solution:

As any container could be started on the server with GPU support manually on the command line with docker run ... and nvidia-smi was accessible and showed results as expected, it became quickly obvious that the problem must have to do something with the jupyterhub configuration and not with docker itself.
And this assumption was correct, as I found a configuration string here that solved the problem:

c.DockerSpawner.extra_host_config = {'runtime': 'nvidia'}

After adding this line to the jupyterhub_config.py and restarting the jupyterhub service, all GPUs were visible wihtn the container and nvidia-smi showed all avaliable GPU-devices as expected.

(I also installed nvidia-docker2 on the server before adding this line to the jupyterhub-config, but I am not sure if this was necessary and did any contribution to this solution. Just to be fair, it is mentioned here…)

4.) Final Question:

The only question left here is, what exactly is the purpose of this line and where is it documented so that I can read further here a little bit? Does anyone please have a hint?

Thanks and regards,
rschauf

Have a look at the NVidia docs, it looks like the NVidia package replaces the Docker engine:
https://docs.nvidia.com/dgx/nvidia-container-runtime-upgrade/index.html

Hi @RSchauf,

As to your question about the extra_host_config, this is passed (after some processing) to the docker client library as the create_host_config parameter, as documented here: https://docker-py.readthedocs.io/en/1.2.2/hostconfig/

Best,
Steen