If a user is the sole user of the actual GPU currently, do you know if that user get to use all time slices and get ~full performance during that time?
Unfortunately, I don’t know this exactly and I can’t test it because I don’t have the experience to use the GPU specifically.
There are two modes (time slicing and MPS) in the plugin.
"In the case of time-slicing, CUDA time-slicing is used to allow workloads sharing a GPU to interleave with each other. However, nothing special is done to isolate workloads that are granted replicas from the same underlying GPU, and each workload has access to the GPU memory and runs in the same fault-domain as of all the others (meaning if one workload crashes, they all do).
In the case of MPS, a control daemon is used to manage access to the shared GPU. In contrast to time-slicing, MPS does space partitioning and allows memory and compute resources to be explicitly partitioned and enforces these limits per workload."
It probably depends on the scenario which mode you use. We don’t have many GPU users here yet, so I assume that time slicing is better in this case.