Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for fabric-attached GPUs #2

Open
sshukun opened this issue Apr 26, 2023 · 2 comments
Open

Support for fabric-attached GPUs #2

sshukun opened this issue Apr 26, 2023 · 2 comments
Labels
feature issue/PR that proposes a new feature or functionality

Comments

@sshukun
Copy link

sshukun commented Apr 26, 2023

In the KEP, the use cases and goals of DRA include using devices dynamically attached from the fabric, for this DRA driver, does NVIDIA have any plan to support fabric-attached GPUs?

Considering that the work of allocating fabric-attached resources contains both infrastructure provider's work (configuring the fabric)
and device vendor specific work (setting up the environment, configuring devices...), I am curious and confused about who should and how to implement the related driver.
Should the infrastructure provider create a custom driver which is able to do both work or should a device vendor create a driver for handling device specific work and enble it to talk to some remote fabric manager to request fabric-attached devices? For example, in this dra driver, add a gRPC client which can ask the remote server to filter out unsuitable nodes for attaching gpus and attach gpus to a specific node or dettach gpus from a specific node, so that the driver can use both local and fabric-attached gpus.
Maybe a common api between a dra driver and a fabric controller component should be discussed.

I would like to know the thought of NVIDIA about such things.

@sshukun
Copy link
Author

sshukun commented May 19, 2023

I want to share some concerns about supporting fabric-attached gpus and I really hope that we can create a driver that support fabric-attached gpus.

Before DRA, to use NVIDIA GPU in kubernetes, there are device plugin and the related NVIDIA GPU operator, so we tested if they can support dynamically attaching/detaching gpus and the result is no because of the next problems:

  • When a new GPU is hot added to the system, although the OS can recognize it as a PCI device correctly, device plugin cannot detected it unless device plugin gets restarted.
  • A GPU, which is not used by any application pods, cannot be removed from a system safely because several pods that are created by the GPU operator are always accessing the gpu. Thus those pods need to be stopped before removing a GPU.
  • Everytime nvidia-driver-daemonset pod gets restarted, nvdrain/drain will be executed as a part of the driver-manager uninstall_driver process during the pod init procedure. This nvdrain/drain process will affect the pods that are already in running phase and are using the other GPUs in the same server. Because of this reason, it is not suitable to remove a gpu by stopping the nvidia-driver-daemonset pod temporarily and restarting it after gpu is removed.

The above problems come from the limitation of the device plugin API and as DRA can do much more than device plugin including using fabric-attached devices, I really hope that we can find a feasible way to create a DRA driver that support fabric-attached gpus and would resolve the above problems.

@ArangoGutierrez ArangoGutierrez added feature issue/PR that proposes a new feature or functionality and removed feature-request labels Feb 22, 2024
@klueska
Copy link
Collaborator

klueska commented Sep 12, 2024

Support for fabric attached GPUs is coming soon. It will be enabled through a technology known as IMEX, which allows the GPU on one node to securely read/write GPU memory on other nodes within the same "IMEX domain":
https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY__POOLS.html#group__CUDART__MEMORY__POOLS_1g8158cc4b2c0d2c2c771f9d1af3cf386e

From an end user's perspective, a resource claim will be used to request access to a shared IMEX channel within a given IMEX domain, and all of the pods running on nodes that want to use this channel to read/write each others GPU memory will need to reference this claim request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature issue/PR that proposes a new feature or functionality
Projects
None yet
Development

No branches or pull requests

3 participants