Nvidia GPU does not work on Jupyterhub Rancher Cluster

coffee · May 31, 2024, 4:14pm

I want to use vGPU in Jupyterhub in Rancher kubernetes environment. I installed Cuda 12.2, Cudnn 8.9 to worker node which has vGPU. vGPU is already licensed.

When i try to list GPU on Jupyterhub I cannot see any GPU.

I’ve installed gpu-operator to rancher cluster.

I used profile lists. When I choose “DL (GPU Count: 1)” this pod starts on the machine which has vGPU. I’ve chosen cuda enabled pytorch image and also tried with tensorflow image but didn’t work.

Could you help me to solve this problem, thank you in advance.

Profile list section:

  profileList:
  - display_name: "ML"
    description: "CPU"
    slug: ml
    default: True
  - display_name: "DL (GPU Count: 1)"
    description: "GPU"
    slug: dl-gpu
    kubespawner_override:
      node_selector: 
       kubernetes.io/hostname: 'serverGPU1'
      image: quay.io/jupyter/pytorch-notebook:cuda12-python-3.11
      environment : {
        'NVIDIA_DRIVER_CAPABILITIES': 'compute,utility',
        'NVIDIA_VISIBLE_DEVICES': 'all',
        'GRANT_SUDO': 'yes'
        }

GPU: Nvidia Tesla T4
gpu-operator helm chart version: gpu-operator-v23.6.1
jupyterhub helm chart version: 3.2.1
torch: 2.3.0+cu121

I configured rke2 runtime with this command.

sudo nvidia-ctk runtime configure --runtime=containerd

This command added these lines to rancher config.toml file.

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes."nvidia"]
  runtime_type = "io.containerd.runc.v2"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes."nvidia".options]
  BinaryName = "/usr/local/nvidia/toolkit/nvidia-container-runtime"

I described node, I can see the nvidia.com/gpu there:

Capacity:
  cpu:                4
  ephemeral-storage:  81902Mi
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             16309852Ki
  nvidia.com/gpu:     1
  pods:               110
Allocatable:
  cpu:                4
  ephemeral-storage:  81586447911
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             16309852Ki
  nvidia.com/gpu:     1
  pods:               110

nvidia device plugin logs:

NVIDIA_DRIVER_ROOT=/
CONTAINER_DRIVER_ROOT=/host
Starting nvidia-device-plugin
Starting FS watcher.
Starting OS watcher.
Starting Plugins.
Loading configuration.
Updating config with default resource matching patterns.
Running with config:
{
"version": "v1",
"flags": {
"migStrategy": "single",
"failOnInitError": true,
"nvidiaDriverRoot": "/",
"gdsEnabled": false,
"mofedEnabled": false,
"plugin": {
"passDeviceSpecs": true,
"deviceListStrategy": [
"envvar"
],
"deviceIDStrategy": "uuid",
"cdiAnnotationPrefix": "cdi.k8s.io/",
"nvidiaCTKPath": "/usr/bin/nvidia-ctk",
"containerDriverRoot": "/host"
}
},
"resources": {
"gpus": [
{
"pattern": "*",
"name": "nvidia.com/gpu"
}
],
"mig": [
{
"pattern": "*",
"name": "nvidia.com/gpu"
}
]
},
"sharing": {
"timeSlicing": {}
}
}
Retreiving plugins.
Detected NVML platform: found NVML library
Detected non-Tegra platform: /sys/devices/soc0/family file not found
Starting GRPC server for 'nvidia.com/gpu'
Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
Registered device plugin for 'nvidia.com/gpu' with Kubelet

Trying this code to see is GPU available, this result always False.

manics · June 3, 2024, 10:41am

Have you tried getting a GPU to work on a single manually created pod wirthout JupyterHub? It should be easier to play around with the K8s/Nvidia configuration, and once it’s working you can add it to the Z2JH config.

Topic		Replies	Views
Cannot to deploy nvidia gpu on JupyterHub Zero to JupyterHub on Kubernetes how-to	1	117	March 20, 2025
GPU not detected in Jupyternotebook on Kubernetes GPU enabled cluster discuss how-to	1	293	August 16, 2024
Jupyterhub with docker + nvidia GPU JupyterHub	4	13318	November 14, 2019
JupyterHub - GPU Notebooks JupyterHub jupyterhub , notebook	0	706	October 6, 2020
GPU enabled JupyterHub with Kubernetes Cluster Zero to JupyterHub on Kubernetes help-wanted	2	4575	September 23, 2022

Nvidia GPU does not work on Jupyterhub Rancher Cluster

Related topics