GPUs can not be shared, but GPUs must be shared

This was originally going to be a lightning talk, but I can’t be there so I think that writing is more efficient.

Some resources are easy to share - CPUs can be easily oversubscribed, memory can be oversubscribed but not over-used. But some resources are different, in particular GPUs. On a cluster, they are extremely expensive, and they can not be shared securely in a multi-tenant environment securely (if we’re wrong on this, please correct us).

For months people asked us for GPUs, but we couldn’t efficiently put them into Jupyter - the efficiency of interactive use would be too low. For research maybe we could solve this by throwing money at it, but for a 500-person course that’s not possible. Eventually, one of my colleagues came up with the catchphrase: “GPUs can not be shared, but they must be shared”. It made me realize that we have to think far outside the box in order to tackle this problem. GPUs can not be shared, so we have to redefine what sharing even means.

So, normal CPU-level multiprocess timesharing doesn’t work. So what are the other options?

We can share GPUs at the level of notebooks: have some sort of batch queue, entire notebooks are sent to the GPU to run. Or we can share GPUs at the levels of cells: try to send single cells to the GPU in some sort of batch/sharing system. Or is there something else?

One option is to share GPUs by whole notebooks: make an system to submit whole notebooks to a queue. It runs the notebook top to bottom, saves the output (maybe the state), and you can non-interactively examine those results an iterate. Huge disadvantage: non-interactive. But this is not so different than interactive debugging on dedicated machines then batch running after code works, so it’s a model that is similar to what people are used to. I’m working on an extension to make this easier (an incidentally, notebooks more like scripts):

The other option I can think of is cell-level execution - there is a cell magic that serializes all notebook state, sends it to a remote execution environment (some sort of batch queue here, hopefully so fast that it’s not visible to users), runs, re-serializes the state, brings it back, and re-loads the new state into the notebook. Advantages: convenient, almost transparent. Disadvantages: not all state can be serialized. Proof of concept by colleague: (haven’t tested myself). Lightning talk by Russell Neches is also relevant here: github: ryneches/SummonUndead (can only post two links, sorry)

I heard a rumor from Open Source Summit in Edinburgh about some way to dynamically add GPUs to and dynamically remove them from containers. This would allow some sort of API (context manager or cell magic) to get the GPU just for the cells that need it. I haven’t seen the container API to do this and wonder there will be problems with the code releasing the GPU.

Are there any more options? None are that good. What do you do?

Other options: throw money at problem (or let users throw money at it), buy slower cheaper GPUs for interactive use, allow full GPUs but with really short culling time.


I don’t think this is very feasible, but I think one way for this to work well is for libraries to acknowledge this, and acquire/release GPUs on a per-call basis much more proactively. A GPU API that allows being kicked off the GPU by someone else and being seamlessly restored as needed (as the OS does all the time with CPU state) would go a long way. I have no idea how to do this in practice, but code blocks like:

with gpu:

would ensure that the GPU is not claimed while the notebook is idle, allowing better multi-tenant behavior.

Implementing the same multi-tenant “it’s okay to kick me out, as long as you put me back” at a lower level would make it work more like OS context switches, but I don’t know who has the expertise for doing something like that.

@lresende @kevin-bates: does Kernel gateway open up for potential GPU sharing? Do you have any experience with such?

Hi @consideRatio - thanks for the ping.

Our experiences with GPUs are also limited. Enterprise Gateway (not Kernel Gateway - I know, it’s confusing) provides the ability to run the kernel on a GPU node, while another kernel could be run on less resource-sensitive nodes within the same Notebook “session”. So, in that sense it introduces a finer-grained approach, but not so much if a given Hub user is only running one notebook anyway (except for the fact that only the kernel is consuming the resource - and not content management and the rest of the NB server ops).

We’re also not familiar with the sharing aspect of GPUs, I figured that would be managed by the underlying libraries. However, if there’s a way to express this in Kubernetes yaml, then EG also allows customization of the kernel pod’s yaml relative to each kernelspec. As a result, you could create kernelspecs that request differing quantities of GPUs. EG does nothing directly related to the management of GPUs itself though.

I hope that helps.

This is a good point about the Enterprise Gateway, and something that I recently thought too: assigning a whole single-user server a notebook a GPU requires the server to be shut down when GPU needs to be reclaimed, but it’s more acceptable to have a kernel with a short runtime limit. I’d like to this this out sometime. I haven’t looked at the gateways much, but I assume it won’t be too hard…

Related, one of my coworkers is now working on a way to bring GPUs into and out of containers dynamically: gpuplug. It’s sort of a proof of concept now, and we need to see how good the Python library support is (the hard part…). If it works, it will be like the context manage example above.

1 Like

We are using the gpushare-scheduler-extender from Alibaba in our Z2JH cluster.

I works reasonably well and allows over subscription of GPU memory. However, it relies on nvidia-docker2 which currently does not (yet) support NVidias MPS and therefore cant enforce any limits on resource usage.

My work around for this issue is by running a periodic Kubernetes job which terminates any pods which violate their quotas. This can be implemented by running nvidia-smi which provides us with PIDs for each GPU task. These PIDs can then be traced back to a Kubernetes pod by looking at the cgroup names in /proc/<pid>/cgroup (see also Heptio Lab’s pid2pod).

However, implementing this is probably not worth the time, since Alibaba is already working on MPS support.

There is also a nice medium article from Alibaba about the implementation giving some more background.

1 Like

Dear stv0g,

Need Help …

we are trying to use “gpushare-scheduler-extender” from Alibaba in our on-premise Z2JH cluster. Getting the following error.

“Error: failed to start container “notebook”: Error response from daemon: OCI runtime create failed: container_linux.go:349: starting container process caused “process_linux.go:449: container init caused “process_linux.go:432: running prestart hook 0 caused \“error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: device error: no-gpu-has-1MiB-to-run: unknown device\\n\”””: unknown”

configuration used in config.yaml: (snippet)

name: jupyter/base-notebook
tag: 2343e33dec46
- display_name: “Learning Data Science - with ONE GPU - shared 1GB”
description: “Datascience Environment with Sample Notebooks - with one gpu - shared 1GB”
image: jupyter/datascience-notebook:2343e33dec46
# GiB 1

Please kindly assist us to configure Z2JH to use “gpushare-scheduler-extender” from Alibaba in our on-premise cluster. can you share config.yaml syntax to configure Z2JH to use “ 1” resource limit …

Thanks in advance …