Hi,
after several other attempts, I have setup Tensorflow and Jupyterhub along the lines of Note: JupyterHub with JupyterLab Install using Conda.
As this is a prototype system (with NVIDIA 3080) for a GPU server (with 2x A100 GPUs), GPU support within the Jupyter Notebooks is essential. However, it is not working. After setting environment variables, the remaining error is:
E tensorflow/stream_executor/cuda/cuda_driver.cc:328] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
In my view, the problem boils down to the following: On the command line, in the conda environment tensorflow2-gpu2, both CPU and GPU are seen:
(tensorflow2-gpu2) admin-nb@jupyter-test:~$ python3
Python 3.9.12 (main, Jun 1 2022, 11:38:51)
[GCC 7.5.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from tensorflow.python.client import device_lib
2022-09-01 13:55:16.780984: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1
>>> device_lib.list_local_devices()
2022-09-01 13:55:18.653413: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-09-01 13:55:18.675498: I tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 3792740000 Hz
2022-09-01 13:55:18.676053: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x555a2f539da0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2022-09-01 13:55:18.676082: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
2022-09-01 13:55:18.677208: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
2022-09-01 13:55:19.591403: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x555a2f953d60 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2022-09-01 13:55:19.591426: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): NVIDIA GeForce RTX 3080, Compute Capability 8.6
2022-09-01 13:55:19.591638: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties:
pciBusID: 0000:05:00.0 name: NVIDIA GeForce RTX 3080 computeCapability: 8.6
coreClock: 1.71GHz coreCount: 68 deviceMemorySize: 9.78GiB deviceMemoryBandwidth: 707.88GiB/s
2022-09-01 13:55:19.591658: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1
2022-09-01 13:55:19.592499: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.10
2022-09-01 13:55:19.592518: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.10
2022-09-01 13:55:19.593350: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2022-09-01 13:55:19.593483: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2022-09-01 13:55:19.594213: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10
2022-09-01 13:55:19.594603: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.10
2022-09-01 13:55:19.596076: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.7
2022-09-01 13:55:19.596269: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0
2022-09-01 13:55:19.596287: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1
2022-09-01 13:55:20.194900: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1261] Device interconnect StreamExecutor with strength 1 edge matrix:
2022-09-01 13:55:20.194921: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1267] 0
2022-09-01 13:55:20.194925: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1280] 0: N
2022-09-01 13:55:20.195255: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1406] Created TensorFlow device (/device:GPU:0 with 9070 MB memory) -> physical GPU (device: 0, name: NVIDIA GeForce RTX 3080, pci bus id: 0000:05:00.0, compute capability: 8.6)
[name: "/device:CPU:0"
device_type: "CPU"
memory_limit: 268435456
locality {
}
incarnation: 16710234344042336047
, name: "/device:XLA_CPU:0"
device_type: "XLA_CPU"
memory_limit: 17179869184
locality {
}
incarnation: 14476011734076499730
physical_device_desc: "device: XLA_CPU device"
, name: "/device:XLA_GPU:0"
device_type: "XLA_GPU"
memory_limit: 17179869184
locality {
}
incarnation: 16669855104248591829
physical_device_desc: "device: XLA_GPU device"
, name: "/device:GPU:0"
device_type: "GPU"
memory_limit: 9510955264
locality {
bus_id: 1
links {
}
}
incarnation: 11115347933796639039
physical_device_desc: "device: 0, name: NVIDIA GeForce RTX 3080, pci bus id: 0000:05:00.0, compute capability: 8.6"
]
In the notebook on JupyterHub, which uses the same conda environment, only the CPU is seen:
from tensorflow.python.client import device_lib
device_lib.list_local_devices()
[name: "/device:CPU:0"
device_type: "CPU"
memory_limit: 268435456
locality {
}
incarnation: 11793224412206947809,
name: "/device:XLA_CPU:0"
device_type: "XLA_CPU"
memory_limit: 17179869184
locality {
}
incarnation: 15586750839120726384
physical_device_desc: "device: XLA_CPU device"]
Any ideas? Any help is appreciated!