I want to use vGPU in Jupyterhub in Rancher kubernetes environment. I installed Cuda 12.2, Cudnn 8.9 to worker node which has vGPU. vGPU is already licensed.
When i try to list GPU on Jupyterhub I cannot see any GPU.
I’ve installed gpu-operator to rancher cluster.
I used profile lists. When I choose “DL (GPU Count: 1)” this pod starts on the machine which has vGPU. I’ve chosen cuda enabled pytorch image and also tried with tensorflow image but didn’t work.
Could you help me to solve this problem, thank you in advance.
Profile list section:
profileList:
- display_name: "ML"
description: "CPU"
slug: ml
default: True
- display_name: "DL (GPU Count: 1)"
description: "GPU"
slug: dl-gpu
kubespawner_override:
node_selector:
kubernetes.io/hostname: 'serverGPU1'
image: quay.io/jupyter/pytorch-notebook:cuda12-python-3.11
environment : {
'NVIDIA_DRIVER_CAPABILITIES': 'compute,utility',
'NVIDIA_VISIBLE_DEVICES': 'all',
'GRANT_SUDO': 'yes'
}
GPU: Nvidia Tesla T4
gpu-operator helm chart version: gpu-operator-v23.6.1
jupyterhub helm chart version: 3.2.1
torch: 2.3.0+cu121
I configured rke2 runtime with this command.
sudo nvidia-ctk runtime configure --runtime=containerd
This command added these lines to rancher config.toml file.
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes."nvidia"]
runtime_type = "io.containerd.runc.v2"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes."nvidia".options]
BinaryName = "/usr/local/nvidia/toolkit/nvidia-container-runtime"
I described node, I can see the nvidia.com/gpu there:
Capacity:
cpu: 4
ephemeral-storage: 81902Mi
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 16309852Ki
nvidia.com/gpu: 1
pods: 110
Allocatable:
cpu: 4
ephemeral-storage: 81586447911
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 16309852Ki
nvidia.com/gpu: 1
pods: 110
nvidia device plugin logs:
NVIDIA_DRIVER_ROOT=/
CONTAINER_DRIVER_ROOT=/host
Starting nvidia-device-plugin
Starting FS watcher.
Starting OS watcher.
Starting Plugins.
Loading configuration.
Updating config with default resource matching patterns.
Running with config:
{
"version": "v1",
"flags": {
"migStrategy": "single",
"failOnInitError": true,
"nvidiaDriverRoot": "/",
"gdsEnabled": false,
"mofedEnabled": false,
"plugin": {
"passDeviceSpecs": true,
"deviceListStrategy": [
"envvar"
],
"deviceIDStrategy": "uuid",
"cdiAnnotationPrefix": "cdi.k8s.io/",
"nvidiaCTKPath": "/usr/bin/nvidia-ctk",
"containerDriverRoot": "/host"
}
},
"resources": {
"gpus": [
{
"pattern": "*",
"name": "nvidia.com/gpu"
}
],
"mig": [
{
"pattern": "*",
"name": "nvidia.com/gpu"
}
]
},
"sharing": {
"timeSlicing": {}
}
}
Retreiving plugins.
Detected NVML platform: found NVML library
Detected non-Tegra platform: /sys/devices/soc0/family file not found
Starting GRPC server for 'nvidia.com/gpu'
Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
Registered device plugin for 'nvidia.com/gpu' with Kubelet
Trying this code to see is GPU available, this result always False.