We cannot access a GPU in our docker container spawned with jupyterhub…
1.) The situation:
-
We are using a headless Ubuntu server 18.0.4 LTS. It has the following stuff installed:
-
miniconda 4.7.10 on a each user base (under each user dir a separate installation, so that each use has his own environments asf.) Maybe it is important to say, that I have different conda environments installed on the host for the uses which spawns the container. These evironments are accessible within the jupyter notebook through different installed kernels, that one may select when running the notebook. For our appication we use a kernel/env other than “base”.
-
We have installed jupyterhub version 1.0.0
-
the following nvidia packages are installed on the headles server:
libnvidia-cfg1-430:amd64 430.40
libnvidia-compute-430:amd64 430.40
libnvidia-container-tools 1.0.5-1
libnvidia-container1:amd64 1.0.5-1
nvidia-compute-utils-430 430.40
nvidia-container-toolkit 1.0.5-1
nvidia-dkms-430 430.40
nvidia-headless-418:amd64 430.40
nvidia-headless-430 430.40
nvidia-headless-no-dkms-430 430.40
nvidia-kernel-common-430 430.40
nvidia-kernel-source-430 430.40
nvidia-utils-418:amd64 430.40
nvidia-utils-430 430.40 -
Docker version 19.03.2, build 6a30dfc:
dpkg -l | grep docker
docker-ce 5:19.03.2~3-0~ubuntu-bionic
docker-ce-cli 5:19.03.2~3-0~ubuntu-bionic -
Systemuser spawner, as we want to enable to access to home directories of the users that are logged into jupyterhub
-
We have 4 Tesla-GPUS, thet we want to enable within 4 docker-containers on the jupyterhub. This means, only max 4 containers run with GPU enables (each one of the Teslas) and the other containers have no GPU. Here ist the info for the GPUs:
nvidia-smi Tue Sep 17 09:04:50 2019 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 430.40 Driver Version: 430.40 CUDA Version: 10.1 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla P100-PCIE... Off | 00000000:02:00.0 Off | 0 | | N/A 32C P0 32W / 250W | 0MiB / 16280MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 1 Tesla P100-PCIE... Off | 00000000:82:00.0 Off | 0 | | N/A 33C P0 31W / 250W | 0MiB / 16280MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 2 Tesla P100-PCIE... Off | 00000000:85:00.0 Off | 0 | | N/A 36C P0 32W / 250W | 0MiB / 16280MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 3 Tesla P100-PCIE... Off | 00000000:86:00.0 Off | 0 | | N/A 34C P0 27W / 250W | 0MiB / 16280MiB | 0% Default | +-------------------------------+----------------------+----------------------+
-
Our Dockerfile:
FROM jupyter/tensorflow-notebook
USER root
RUN apt-get update && apt-get install -y --no-install-recommends \
gnupg2 curl ca-certificates && \
curl -fsSL https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/7fa2af80.pub | apt-key add - && \
echo "deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64 /" > /etc/apt/sources.list.d/cuda.list && \
echo "deb https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64 /" > /etc/apt/sources.list.d/nvidia-ml.list && \
apt-get purge --autoremove -y curl && \
rm -rf /var/lib/apt/lists/*
ENV CUDA_VERSION 10.1.243
ENV CUDA_PKG_VERSION 10-1=$CUDA_VERSION-1
# For libraries in the cuda-compat-* package: https://docs.nvidia.com/cuda/eula/index.html#attachment-a
RUN apt-get update && apt-get install -y --no-install-recommends \
cuda-cudart-$CUDA_PKG_VERSION \
cuda-compat-10-1 && \
ln -s cuda-10.1 /usr/local/cuda && \
rm -rf /var/lib/apt/lists/*
# Required for nvidia-docker v1
RUN echo "/usr/local/nvidia/lib" >> /etc/ld.so.conf.d/nvidia.conf && \
echo "/usr/local/nvidia/lib64" >> /etc/ld.so.conf.d/nvidia.conf
ENV PATH /usr/local/nvidia/bin:/usr/local/cuda/bin:${PATH}
ENV LD_LIBRARY_PATH /usr/local/nvidia/lib:/usr/local/nvidia/lib64
# nvidia-container-runtime
ENV NVIDIA_VISIBLE_DEVICES all
ENV NVIDIA_DRIVER_CAPABILITIES compute,utility
ENV NVIDIA_REQUIRE_CUDA "cuda>=10.1 brand=tesla,driver>=384,driver<385 brand=tesla,driver>=396,driver<397 brand=tesla,driver>=410,driver<411"
ENV NCCL_VERSION 2.4.8
RUN apt-get update && apt-get install -y --no-install-recommends \
cuda-libraries-$CUDA_PKG_VERSION \
cuda-nvtx-$CUDA_PKG_VERSION \
libnccl2=$NCCL_VERSION-1+cuda10.1 && \
apt-mark hold libnccl2 && \
rm -rf /var/lib/apt/lists/*
RUN apt-get update && apt-get install -y --no-install-recommends \
cuda-nvml-dev-$CUDA_PKG_VERSION \
cuda-command-line-tools-$CUDA_PKG_VERSION \
cuda-libraries-dev-$CUDA_PKG_VERSION \
cuda-minimal-build-$CUDA_PKG_VERSION \
libnccl-dev=$NCCL_VERSION-1+cuda10.1 \
&& \
rm -rf /var/lib/apt/lists/*
ENV LIBRARY_PATH /usr/local/cuda/lib64/stubs
ENV CUDNN_VERSION 7.6.3.30
LABEL com.nvidia.cudnn.version="${CUDNN_VERSION}"
RUN apt-get update && apt-get install -y --no-install-recommends \
libcudnn7=$CUDNN_VERSION-1+cuda10.1 \
libcudnn7-dev=$CUDNN_VERSION-1+cuda10.1 \
&& \
apt-mark hold libcudnn7 && \
rm -rf /var/lib/apt/lists/*
# Switch back to jovyan to avoid accidental container runs as root
USER $NB_UID
2.) My question:
Later, we want to be able to provide one GPU within 4 containers. At this time I have set up a container that is spawnable with jupyterhub and that should have GPU enabled:
# nvidia-container-runtime
ENV NVIDIA_VISIBLE_DEVICES all
ENV NVIDIA_DRIVER_CAPABILITIES compute,utility
But when I spawn it, and open a terminal in the jupyter menu, then input of nvidia-smi leads to no output like on the headless host. What is the reason for this?
If I manually spawn a container with GPU, nvidia-smi works within, here:
docker run --gpus all hhn/gpu-tensorflow-notebook:latest nvidia-smi
gives the exact same output from within the container as a nvidia-smi on the host above, see output above! Therefore I did not install nvidia-docker
or nvidia-docker2
, as I think it is deprecated?
Do I miss the installation of those or even other packages?
Furthermore, a
curl -s localhost:3476/docker/cli
leads to no output…
With verbose -v I get:
curl -v -s localhost:3476/docker/cli
* Trying ::1...
* TCP_NODELAY set
* connect to ::1 port 3476 failed: Connection refused
* Trying 127.0.0.1...
* TCP_NODELAY set
* connect to 127.0.0.1 port 3476 failed: Connection refused
* Failed to connect to localhost port 3476: Connection refused
* Closing connection 0
When I spawn the container and start a jupyter notebook, I can check if a GPU is present, by performing the following code within a code cell:
import math
import numpy as np
from numba import cuda
import matplotlib.pyplot as plt
%matplotlib inline
The I query number of visible GPUS with:
len(cuda.gpus)
This in turn leads to the error:
CudaSupportError: Error at driver init:
CUDA driver library cannot be found.
If you are sure that a CUDA driver is installed,
try setting environment variable NUMBA_CUDA_DRIVER
with the file path of the CUDA driver shared library.
I think I am missing something, right?
On the other hand, when I check the cuda installation within the spawned container, I get:
nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Sun_Jul_28_19:07:16_PDT_2019
Cuda compilation tools, release 10.1, V10.1.243
(base) rschaufler@7c77f38e42e5:~$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Sun_Jul_28_19:07:16_PDT_2019
Cuda compilation tools, release 10.1, V10.1.243
If you need some more data from my machine, please let me know , I will provide them asap,
thanks an regards
rschauf