Correct understanding of cuda_copy and gdr_copy #10491

vitduck · 2025-02-14T07:02:39Z

vitduck
Feb 14, 2025

Hi,

After reading UCX manuals and related presentations, I am still not sure how cuda_copy and gdr_copy operate at hardware levels.
My current understanding is as follow:

cuda_ipc is straightforward:
- Send/recv between two GPUs attached to different MPI processes
- Require P2P capability, i.e. GPUs under same PCIe root complex
cuda_copy and cuda_gdr will be used when GPUDirect P2P is not available. For instance

cuda_copy will use cudaMemcpyDeviceToDevice which stages through host buffer, i.e.
gpu buffer -> pinned mem (green) -> host buffer (red) -> qpi -> host buffer (red) -> host pinned mem (green) gpu buffer
gdr_copy optimizes cuda_copy further by removing the internal copy between pinned buffer and host memory, i.e.
gpu buffer -> host mem (red) -> qpi -> host mem (red) -> gpu buffer
Perhaps this will improve latency with small message.

From NVIDIA's own depiction of gdr_copy:

I am not sure how to interpret the arrow directions.

cudaMemcpy: H2D is depicted with U-turn arrow whereas D2H is with straight arrow
gdr_copy: H2D is depicted with straight arrow whereas D2H is with U-turn
Is the U-turn arrow represents an extra copy ?

I am much appreciate if you can point out mistakes in my understanding or share your insights on the matter.

Regards.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Correct understanding of cuda_copy and gdr_copy #10491

{{title}}

Replies: 0 comments

Select a reply

Correct understanding of cuda_copy and gdr_copy #10491

vitduck Feb 14, 2025

Replies: 0 comments

vitduck
Feb 14, 2025