I installed nvidia gpu and microk8s 1.29 on ubuntu 24.04.
administer@ultimate-force:~$ nvidia-smi
Thu Mar 20 11:42:47 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.120 Driver Version: 550.120 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce GTX 1080 Ti Off | 00000000:C1:00.0 Off | N/A |
| 0% 31C P8 11W / 280W | 16MiB / 11264MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 2171 G /usr/lib/xorg/Xorg 9MiB |
| 0 N/A N/A 4131 G /usr/bin/gnome-shell 3MiB |
+-----------------------------------------------------------------------------------------+
administer@ultimate-force:~$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Fri_Jan__6_16:45:21_PST_2023
Cuda compilation tools, release 12.0, V12.0.140
Build cuda_12.0.r12.0/compiler.32267302_0
But I cannot enable gpu by: microk8s enable gpu. So I do:
administer@ultimate-force:~$ microk8s helm3 install gpu-operator -n gpu-operator --create-namespace nvidia/gpu-operator --set driver.enabled=false --set toolkit.env[0].name=CONTAINERD_CONFIG --set toolkit.env[0].value=/var/snap/microk8s/current/args/containerd-template.toml --set toolkit.env[1].name=CONTAINERD_SOCKET --set toolkit.env[1].value=/var/snap/microk8s/common/run/containerd.sock --set toolkit.env[2].name=CONTAINERD_RUNTIME_CLASS --set toolkit.env[2].value=nvidia --set toolkit.env[3].name=CONTAINERD_SET_AS_DEFAULT --set-string toolkit.env[3].value=true
NAME: gpu-operator
LAST DEPLOYED: Wed Mar 19 13:10:20 2025
NAMESPACE: gpu-operator
STATUS: deployed
REVISION: 1
TEST SUITE: None
GPU and cuda work fine when I check:
administer@ultimate-force:~$ microk8s kubectl logs -n gpu-operator -l app=nvidia-operator-validator -c nvidia-operator-validator
all validations are successful
administer@ultimate-force:~$ microk8s kubectl apply -f - <<EOF
> apiVersion: v1
> kind: Pod
> metadata:
> name: cuda-vector-add
> spec:
> restartPolicy: OnFailure
> containers:
> - name: cuda-vector-add
> image: "k8s.gcr.io/cuda-vector-add:v0.1"
> resources:
> limits:
> nvidia.com/gpu: 1
> EOF
pod/cuda-vector-add created
administer@ultimate-force:~$ microk8s kubectl logs cuda-vector-add
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done
Then after I deployed Jupyterhub successfully, I can sign in Jupyterhub and install tensorflow by:
python3 -m pip install ‘tensorflow[and-cuda]’
But I cannot import tensorflow from python3:
$ python3
Python 3.12.8 (main, Jan 14 2025, 02:29:13) [GCC 12.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
2025-03-20 09:31:17.182480: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1742463077.197775 74 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1742463077.202298 74 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1742463077.214892 74 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1742463077.214919 74 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1742463077.214922 74 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1742463077.214925 74 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
2025-03-20 09:31:17.218726: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Any idea?