Support for fabric-attached GPUs #2

sshukun · 2023-04-26T01:45:23Z

In the KEP, the use cases and goals of DRA include using devices dynamically attached from the fabric, for this DRA driver, does NVIDIA have any plan to support fabric-attached GPUs?

Considering that the work of allocating fabric-attached resources contains both infrastructure provider's work (configuring the fabric)
and device vendor specific work (setting up the environment, configuring devices...), I am curious and confused about who should and how to implement the related driver.
Should the infrastructure provider create a custom driver which is able to do both work or should a device vendor create a driver for handling device specific work and enble it to talk to some remote fabric manager to request fabric-attached devices? For example, in this dra driver, add a gRPC client which can ask the remote server to filter out unsuitable nodes for attaching gpus and attach gpus to a specific node or dettach gpus from a specific node, so that the driver can use both local and fabric-attached gpus.
Maybe a common api between a dra driver and a fabric controller component should be discussed.

I would like to know the thought of NVIDIA about such things.

sshukun · 2023-05-19T00:31:47Z

I want to share some concerns about supporting fabric-attached gpus and I really hope that we can create a driver that support fabric-attached gpus.

Before DRA, to use NVIDIA GPU in kubernetes, there are device plugin and the related NVIDIA GPU operator, so we tested if they can support dynamically attaching/detaching gpus and the result is no because of the next problems:

When a new GPU is hot added to the system, although the OS can recognize it as a PCI device correctly, device plugin cannot detected it unless device plugin gets restarted.
A GPU, which is not used by any application pods, cannot be removed from a system safely because several pods that are created by the GPU operator are always accessing the gpu. Thus those pods need to be stopped before removing a GPU.
Everytime nvidia-driver-daemonset pod gets restarted, nvdrain/drain will be executed as a part of the driver-manager uninstall_driver process during the pod init procedure. This nvdrain/drain process will affect the pods that are already in running phase and are using the other GPUs in the same server. Because of this reason, it is not suitable to remove a gpu by stopping the nvidia-driver-daemonset pod temporarily and restarting it after gpu is removed.

The above problems come from the limitation of the device plugin API and as DRA can do much more than device plugin including using fabric-attached devices, I really hope that we can find a feasible way to create a DRA driver that support fabric-attached gpus and would resolve the above problems.

klueska · 2024-09-12T15:40:34Z

Support for fabric attached GPUs is coming soon. It will be enabled through a technology known as IMEX, which allows the GPU on one node to securely read/write GPU memory on other nodes within the same "IMEX domain":
https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY__POOLS.html#group__CUDART__MEMORY__POOLS_1g8158cc4b2c0d2c2c771f9d1af3cf386e

From an end user's perspective, a resource claim will be used to request access to a shared IMEX channel within a given IMEX domain, and all of the pods running on nodes that want to use this channel to read/write each others GPU memory will need to reference this claim request.

klueska added the enhancement label Jan 25, 2024

ArangoGutierrez added feature issue/PR that proposes a new feature or functionality and removed feature-request labels Feb 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for fabric-attached GPUs #2

Support for fabric-attached GPUs #2

sshukun commented Apr 26, 2023

sshukun commented May 19, 2023 •

edited

Loading

klueska commented Sep 12, 2024

Support for fabric-attached GPUs #2

Support for fabric-attached GPUs #2

Comments

sshukun commented Apr 26, 2023

sshukun commented May 19, 2023 • edited Loading

klueska commented Sep 12, 2024

sshukun commented May 19, 2023 •

edited

Loading