Nvidia GPU does not work on Jupyterhub Rancher Cluster

I want to use vGPU in Jupyterhub in Rancher kubernetes environment. I installed Cuda 12.2, Cudnn 8.9 to worker node which has vGPU. vGPU is already licensed.

When i try to list GPU on Jupyterhub I cannot see any GPU.

I’ve installed gpu-operator to rancher cluster.

I used profile lists. When I choose “DL (GPU Count: 1)” this pod starts on the machine which has vGPU. I’ve chosen cuda enabled pytorch image and also tried with tensorflow image but didn’t work.

Could you help me to solve this problem, thank you in advance. :slight_smile:

Profile list section:

  profileList:
  - display_name: "ML"
    description: "CPU"
    slug: ml
    default: True
  - display_name: "DL (GPU Count: 1)"
    description: "GPU"
    slug: dl-gpu
    kubespawner_override:
      node_selector: 
       kubernetes.io/hostname: 'serverGPU1'
      image: quay.io/jupyter/pytorch-notebook:cuda12-python-3.11
      environment : {
        'NVIDIA_DRIVER_CAPABILITIES': 'compute,utility',
        'NVIDIA_VISIBLE_DEVICES': 'all',
        'GRANT_SUDO': 'yes'
        }
GPU: Nvidia Tesla T4
gpu-operator helm chart version: gpu-operator-v23.6.1
jupyterhub helm chart version: 3.2.1
torch: 2.3.0+cu121

I configured rke2 runtime with this command.

sudo nvidia-ctk runtime configure --runtime=containerd

This command added these lines to rancher config.toml file.

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes."nvidia"]
  runtime_type = "io.containerd.runc.v2"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes."nvidia".options]
  BinaryName = "/usr/local/nvidia/toolkit/nvidia-container-runtime"

I described node, I can see the nvidia.com/gpu there:

Capacity:
  cpu:                4
  ephemeral-storage:  81902Mi
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             16309852Ki
  nvidia.com/gpu:     1
  pods:               110
Allocatable:
  cpu:                4
  ephemeral-storage:  81586447911
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             16309852Ki
  nvidia.com/gpu:     1
  pods:               110

nvidia device plugin logs:

NVIDIA_DRIVER_ROOT=/
CONTAINER_DRIVER_ROOT=/host
Starting nvidia-device-plugin
Starting FS watcher.
Starting OS watcher.
Starting Plugins.
Loading configuration.
Updating config with default resource matching patterns.
Running with config:
{
"version": "v1",
"flags": {
"migStrategy": "single",
"failOnInitError": true,
"nvidiaDriverRoot": "/",
"gdsEnabled": false,
"mofedEnabled": false,
"plugin": {
"passDeviceSpecs": true,
"deviceListStrategy": [
"envvar"
],
"deviceIDStrategy": "uuid",
"cdiAnnotationPrefix": "cdi.k8s.io/",
"nvidiaCTKPath": "/usr/bin/nvidia-ctk",
"containerDriverRoot": "/host"
}
},
"resources": {
"gpus": [
{
"pattern": "*",
"name": "nvidia.com/gpu"
}
],
"mig": [
{
"pattern": "*",
"name": "nvidia.com/gpu"
}
]
},
"sharing": {
"timeSlicing": {}
}
}
Retreiving plugins.
Detected NVML platform: found NVML library
Detected non-Tegra platform: /sys/devices/soc0/family file not found
Starting GRPC server for 'nvidia.com/gpu'
Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
Registered device plugin for 'nvidia.com/gpu' with Kubelet

Trying this code to see is GPU available, this result always False. :frowning:

image

Have you tried getting a GPU to work on a single manually created pod wirthout JupyterHub? It should be easier to play around with the K8s/Nvidia configuration, and once it’s working you can add it to the Z2JH config.