GPUs can not be shared, but GPUs must be shared

stv0g · May 3, 2020, 12:31pm

We are using the gpushare-scheduler-extender from Alibaba in our Z2JH cluster.

I works reasonably well and allows over subscription of GPU memory. However, it relies on nvidia-docker2 which currently does not (yet) support NVidias MPS and therefore cant enforce any limits on resource usage.

My work around for this issue is by running a periodic Kubernetes job which terminates any pods which violate their quotas. This can be implemented by running nvidia-smi which provides us with PIDs for each GPU task. These PIDs can then be traced back to a Kubernetes pod by looking at the cgroup names in /proc/<pid>/cgroup (see also Heptio Lab’s pid2pod).

However, implementing this is probably not worth the time, since Alibaba is already working on MPS support.

There is also a nice medium article from Alibaba about the implementation giving some more background.

Topic		Replies	Views
Remote execution of code using GPUs in Jupyter discuss jupyterhub , how-to , help-wanted	2	2081	September 16, 2022
How to share GPU to mutiple pods? Insufficient nvidia.com/gpu JupyterHub jupyterhub , help-wanted	6	1609	June 3, 2024
JupyterHub Multi GPU Zero to JupyterHub on Kubernetes	2	1835	June 15, 2021
GPU usage from all nodes while using JupyterHub on Kubernetes JupyterHub jupyterhub , how-to , help-wanted	3	342	October 9, 2024
How can we limit 1GPU per user ? so that a single user will not consume all the available GPU The Littlest JupyterHub	2	745	August 29, 2019

GPUs can not be shared, but GPUs must be shared

Related topics