How to share GPU to mutiple pods? Insufficient nvidia.com/gpu

Hello Community,

My JupyterHub runs on Kubernetes and I use the NVIDIA/k8s-device-plugin so that the pods can access the GPU.

I have the following problem, or maybe I’m misunderstanding something.
In the config for the profile for the GPU I have set an extra_resource_limits.

extra_resource_limits:
    nvidia.com/gpu: ‘1’

However, two or more users cannot use the GPU at the same time and the other pods gets the message “Insufficient nvidia .com/gpu”.

Is it possible for multiple users to use the GPU at the same time?
If so, how?

Thank you very much for your help.

According to the K8s docs:

You can specify GPU limits without specifying requests, because Kubernetes will use the limit as the request value by default.

So it sounds like the limit is also treated as the request.

GPU sharing isn’t straight forward, this video clarifies where things are currently and what is planned into the future: https://youtu.be/Q2GuTUO170w?si=cmuG1rl_0WfM5eDq

Check 16 min and 15 seconds in for the relevant parts about sharing GPU.

Thanks for the tip, I found the time slicing feature. I just had to activate the feature in the NVIDIA/k8s-device-plugin.

If anyone has the same problem, this is the config for the plugin I’m using now. replicas can also be a different number.

version: v1
flags:
  migStrategy: "none"
  failOnInitError: true
  nvidiaDriverRoot: "/"
  plugin:
    passDeviceSpecs: false
    deviceListStrategy: "envvar"
    deviceIDStrategy: "uuid"
sharing:
  timeSlicing:
    resources:
    - name: nvidia.com/gpu
      replicas: 20
2 Likes

Thank you for sharing this @LeafLikeApple!!

If a user is the sole user of the actual GPU currently, do you know if that user get to use all time slices and get ~full performance during that time?

1 Like

Unfortunately, I don’t know this exactly and I can’t test it because I don’t have the experience to use the GPU specifically.
There are two modes (time slicing and MPS) in the plugin.

"In the case of time-slicing, CUDA time-slicing is used to allow workloads sharing a GPU to interleave with each other. However, nothing special is done to isolate workloads that are granted replicas from the same underlying GPU, and each workload has access to the GPU memory and runs in the same fault-domain as of all the others (meaning if one workload crashes, they all do).

In the case of MPS, a control daemon is used to manage access to the shared GPU. In contrast to time-slicing, MPS does space partitioning and allows memory and compute resources to be explicitly partitioned and enforces these limits per workload."

It probably depends on the scenario which mode you use. We don’t have many GPU users here yet, so I assume that time slicing is better in this case.

I’m interested in this case. Does anyone have more information on this matter?