You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
After reading UCX manuals and related presentations, I am still not sure how cuda_copy and gdr_copy operate at hardware levels.
My current understanding is as follow:
cuda_ipc is straightforward:
Send/recv between two GPUs attached to different MPI processes
Require P2P capability, i.e. GPUs under same PCIe root complex
cuda_copy and cuda_gdr will be used when GPUDirect P2P is not available. For instance
cuda_copy will use cudaMemcpyDeviceToDevice which stages through host buffer, i.e. gpu buffer -> pinned mem (green) -> host buffer (red) -> qpi -> host buffer (red) -> host pinned mem (green) gpu buffer
gdr_copy optimizes cuda_copy further by removing the internal copy between pinned buffer and host memory, i.e. gpu buffer -> host mem (red) -> qpi -> host mem (red) -> gpu buffer
Perhaps this will improve latency with small message.
From NVIDIA's own depiction of gdr_copy:
I am not sure how to interpret the arrow directions.
cudaMemcpy: H2D is depicted with U-turn arrow whereas D2H is with straight arrow
gdr_copy: H2D is depicted with straight arrow whereas D2H is with U-turn
Is the U-turn arrow represents an extra copy ?
I am much appreciate if you can point out mistakes in my understanding or share your insights on the matter.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Hi,
After reading UCX manuals and related presentations, I am still not sure how
cuda_copy
andgdr_copy
operate at hardware levels.My current understanding is as follow:
cuda_ipc
is straightforward:cuda_copy
andcuda_gdr
will be used when GPUDirect P2P is not available. For instancecuda_copy
will usecudaMemcpyDeviceToDevice
which stages through host buffer, i.e.gpu buffer -> pinned mem (green) -> host buffer (red) -> qpi -> host buffer (red) -> host pinned mem (green) gpu buffer
gdr_copy
optimizescuda_copy
further by removing the internal copy between pinned buffer and host memory, i.e.gpu buffer -> host mem (red) -> qpi -> host mem (red) -> gpu buffer
Perhaps this will improve latency with small message.
From NVIDIA's own depiction of gdr_copy:
![](https://private-user-images.githubusercontent.com/575950/413190086-8a65960e-1c71-47cd-86ce-84b07de86725.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3Mzk2MDQxMDQsIm5iZiI6MTczOTYwMzgwNCwicGF0aCI6Ii81NzU5NTAvNDEzMTkwMDg2LThhNjU5NjBlLTFjNzEtNDdjZC04NmNlLTg0YjA3ZGU4NjcyNS5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjUwMjE1JTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI1MDIxNVQwNzE2NDRaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT0zOTVlOGYxY2U4MmUwOWE4MTYyNWU1NjM0NDE5MzM0OTgzZmM1ZTg0ZDg0MzYzODkzMzA0NGYyYjE2ZjM2MTkwJlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCJ9.hpnEZJ6aAnPQ2DlruuqD3GXGx33g0Bc_avl9wWUnRh4)
I am not sure how to interpret the arrow directions.
Is the U-turn arrow represents an extra copy ?
I am much appreciate if you can point out mistakes in my understanding or share your insights on the matter.
Regards.
Beta Was this translation helpful? Give feedback.
All reactions