fix(gfx): copy subExecParam from device before reading timing data#323
fix(gfx): copy subExecParam from device before reading timing data#323fcui-amd wants to merge 1 commit into
Conversation
There was a problem hiding this comment.
Pull request overview
This PR fixes a crash in the GPU GFX executor when collecting per-Transfer/per-iteration timing and CU data on systems where VRAM is not host-accessible (e.g., no large/resizable BAR) by copying SubExecParam back from device before reading fields on the host.
Changes:
- Copy
rss.subExecParamGpuPtrfrom device to host before collecting per-iteration CU IDs in multi-stream mode. - Copy
exeInfo.subExecParamGpufrom device to host before computing per-Transfer timing/CU IDs in single-stream mode.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
GFX executor read startCycle/stopCycle/xccId/hwId directly from subExecParamGpu (device memory allocated via hipMalloc). This assumed VRAM is host-accessible, which fails on GPUs without large/resizable BAR (e.g. PCIe cards), causing SIGSEGV. Now copy params back to host via hipMemcpy before reading in both single-launch and multistream paths. Signed-off-by: Flora Cui <flora.cui@amd.com>
674a070 to
45d6c97
Compare
|
ping? |
|
This was somewhat intentional, to try to avoid explicit SDMA traffic during the main loop. The solution for non-AMD hardware was actually to use [managed memory instead] for the GPU parameter buffer. TransferBench/src/header/TransferBench.hpp Line 4371 in 2bc42cd I think this PR needs to be more targeted, and only be applied to GPUs that require this functionality. |
|
It's amd gpu with large bar disabled. |
| // Copy subExecutor parameters back to host before reading them. | ||
| // subExecParamGpuPtr lives in device memory and is not guaranteed to be host-accessible | ||
| // (e.g. GPUs without large/resizable BAR), so dereferencing it directly can segfault. | ||
| std::vector<SubExecParam> subExecParamHost(numSubExecs); | ||
| ERR_CHECK(hipMemcpy(subExecParamHost.data(), rss.subExecParamGpuPtr, | ||
| numSubExecs * sizeof(SubExecParam), hipMemcpyDefault)); |
There was a problem hiding this comment.
This should be gated only for GPUs that do not have largeBar support, so that it doesn't change the behavior of GPUs that do have largeBar support.
| // Copy subExecutor parameters back to host before reading timing data. | ||
| // subExecParamGpu lives in device memory and is not guaranteed to be host-accessible | ||
| // (e.g. GPUs without large/resizable BAR), so dereferencing it directly can segfault. | ||
| std::vector<SubExecParam> subExecParamHost(exeInfo.totalSubExecs); | ||
| ERR_CHECK(hipMemcpy(subExecParamHost.data(), exeInfo.subExecParamGpu, | ||
| exeInfo.totalSubExecs * sizeof(SubExecParam), hipMemcpyDefault)); | ||
|
|
There was a problem hiding this comment.
Same as above. Gate this copy based on whether or not this GPU has largeBar support.
Motivation
The GFX executor crashes with a SIGSEGV when collecting per-Transfer timing
data on GPUs that do not expose VRAM to the host (e.g. PCIe cards without
large/resizable BAR).
Technical Details
Copy the subExecutor parameters back to host with
hipMemcpy(..., DeviceToHost)before reading timing/CU data, in both affected paths. No behavior change on
hardware where direct access previously worked.
Test Plan
Test Result
Submission Checklist