Problem Description
Environment
- GPU: AMD Radeon RX 9060 XT (gfx1200 / RDNA 4 / Navi 44)
- VRAM: 16 GB GDDR6
- OS (Windows): Windows 10 22H2 (Build 19045)
- AMD Driver: Adrenalin Edition 26.2.2
- WSL2: Ubuntu 24.04 LTS
- ROCm: 7.2.1
- librocdxg: built from main branch (v1.1.1)
- PyTorch: 2.9.1+rocm7.2.1
Problem Description
On a discrete RX 9060 XT GPU, GetSegmentId consistently fails for segment type 1
(D3DKMT_QUERYSTATISTICS_SEGMENT_TYPE_MEMORY), which is the local/dedicated VRAM segment.
This causes VramAvail() to return HSA_STATUS_ERROR, which PyTorch interprets as
0 bytes free, making all model loading fail with loaded partially; 0.00 MB usable.
The GPU is correctly detected by rocminfo, HSA_ENABLE_DXG_DETECTION=1 is set,
/dev/dxg is present, and torch.cuda.is_available() returns True.
However, torch.cuda.mem_get_info() returns (4096, 16987488256) — only 4 KB free
despite 16 GB total VRAM, making it impossible to load any model.
Operating System
WSL2 Ubuntu 24.04.4
CPU
Ryzen 9 3900
GPU
RX 9060 XT
ROCm Version
7.2.1.70201-81~24.04
ROCm Component
No response
Steps to Reproduce
Reproduction Steps
- Install ROCm 7.2.1 on WSL2 Ubuntu 24.04 with
amdgpu-install --usecase=rocm --no-dkms
- Build and install librocdxg from source with Windows SDK path
- Set
HSA_ENABLE_DXG_DETECTION=1
- Run
rocminfo — GPU is detected correctly (Agent 2, gfx1200)
- Run
python -c "import torch; print(torch.cuda.mem_get_info())" → (4096, 16987488256)
- Try loading any model in PyTorch or ComfyUI →
loaded partially; 0.00 MB usable, 0.00 MB loaded
Debug Output
Running with HSAKMT_DEBUG_LEVEL=7:
(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support
WSL environment detected.
HSA System Attributes
Runtime Version: 1.18
Runtime Ext Version: 1.15
System Timestamp Freq.: 1000.000000MHz
Sig. Max Wait Duration: 18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model: LARGE
System Endianness: LITTLE
Mwaitx: DISABLED
XNACK enabled: NO
DMAbuf Support: YES
VMM Support: YES
==========
HSA Agents
Agent 1
Name: AMD Ryzen 9 3900 12-Core Processor
Uuid: CPU-XX
Marketing Name: AMD Ryzen 9 3900 12-Core Processor
Vendor Name: CPU
Feature: None specified
Profile: FULL_PROFILE
Float Round Mode: NEAR
Max Queue Number: 0(0x0)
Queue Min Size: 0(0x0)
Queue Max Size: 0(0x0)
Queue Type: MULTI
Node: 0
Device Type: CPU
Cache Info:
L1: 32768(0x8000) KB
Chip ID: 0(0x0)
Cacheline Size: 64(0x40)
BDFID: 0
Internal Node ID: 0
Compute Unit: 8
SIMDs per CU: 0
Shader Engines: 0
Shader Arrs. per Eng.: 0
Memory Properties:
Features: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: FINE GRAINED
Size: 28739804(0x1b688dc) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 2
Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED
Size: 28739804(0x1b688dc) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 3
Segment: GLOBAL; FLAGS: KERNARG, FINE GRAINED
Size: 28739804(0x1b688dc) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 4
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 28739804(0x1b688dc) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
ISA Info:
Agent 2
Name: gfx1200
Uuid: GPU-aa3d0bc8364af119
Marketing Name: AMD Radeon RX 9060 XT
Vendor Name: AMD
Feature: KERNEL_DISPATCH
Profile: BASE_PROFILE
Float Round Mode: NEAR
Max Queue Number: 128(0x80)
Queue Min Size: 64(0x40)
Queue Max Size: 131072(0x20000)
Queue Type: MULTI
Node: 1
Device Type: GPU
Cache Info:
L1: 32(0x20) KB
L3: 32768(0x8000) KB
Chip ID: 30096(0x7590)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 2700
BDFID: 3072
Internal Node ID: 1
Compute Unit: 32
SIMDs per CU: 2
Shader Engines: 2
Shader Arrs. per Eng.: 2
Coherent Host Access: FALSE
Memory Properties:
Features: KERNEL_DISPATCH
Fast F16 Operation: TRUE
Wavefront Size: 32(0x20)
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Max Waves Per CU: 32(0x20)
Max Work-item Per CU: 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 2147483647(0x7fffffff)
y 65535(0xffff)
z 65535(0xffff)
Max fbarriers/Workgrp: 32
Packet Processor uCode:: 108
SDMA engine uCode:: 0
IOMMU Support:: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 16589344(0xfd2220) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 2
Segment: GROUP
Size: 64(0x40) KB
Allocatable: FALSE
Alloc Granule: 0KB
Alloc Recommended Granule:0KB
Alloc Alignment: 0KB
Accessible by all: FALSE
ISA Info:
ISA 1
Name: amdgcn-am
Additional Information
[QuerySegmentInfo] Total Segments: 4
[GetSegmentId] Failed to get segment id for type 1
The GPU has 4 segments, but none of them matches
D3DKMT_QUERYSTATISTICS_SEGMENT_TYPE_MEMORY (type 1).
The GetSegmentId call in VramAvail() therefore always fails and returns
HSA_STATUS_ERROR.
rocminfo correctly reports 16 GB VRAM in Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 16589344(0xfd2220) KB
Allocatable: TRUE
Root Cause Analysis
The issue is in WDDMDevice::VramAvail() in src/wddm/device.cpp:
if (!GetSegmentId(D3DKMT_QUERYSTATISTICS_SEGMENT_TYPE_MEMORY, segmentId))
return HSA_STATUS_ERROR;
GetSegmentId iterates over segment_infos_ (populated by QuerySegmentInfo)
and looks for a segment whose segment_type matches
D3DKMT_QUERYSTATISTICS_SEGMENT_TYPE_MEMORY. However, on the RX 9060 XT (gfx1200),
none of the 4 reported segments has this type — suggesting that ParseAdapterInfo
(in the closed-source libthunk_proxy.a) is not correctly classifying the dedicated
VRAM segment as type MEMORY for this discrete RDNA 4 GPU in WSL2.
As a result:
VramAvail() always returns HSA_STATUS_ERROR
- PyTorch's
mem_get_info() sees only 4 KB free (near-zero sentinel value)
- All model inference fails — GPU compute works fine but memory allocation does not
Additional Notes
- Small allocations work:
torch.zeros(100).cuda() succeeds
- Large allocations fail silently (process hangs with no output)
IsDgpu() returns true correctly (the GPU is recognized as discrete)
LocalHeapSize() returns the correct 16 GB value (used by rocminfo)
- The problem is isolated to the segment type classification in
ParseAdapterInfo
inside libthunk_proxy.a, which cannot be patched without access to its source
This issue appears specific to the discrete RX 9060 XT (gfx1200) in WSL2.
APU-based setups (Strix Halo, etc.) seem to have a different but related memory
mapping issue (#6022 on ROCm/ROCm).
Expected Behavior
torch.cuda.mem_get_info() should return approximately (16_000_000_000, 16_987_488_256),
and model loading should use GPU VRAM as on native Linux or Windows.
Problem Description
Environment
Problem Description
On a discrete RX 9060 XT GPU,
GetSegmentIdconsistently fails for segment type1(
D3DKMT_QUERYSTATISTICS_SEGMENT_TYPE_MEMORY), which is the local/dedicated VRAM segment.This causes
VramAvail()to returnHSA_STATUS_ERROR, which PyTorch interprets as0 bytes free, making all model loading fail withloaded partially; 0.00 MB usable.The GPU is correctly detected by
rocminfo,HSA_ENABLE_DXG_DETECTION=1is set,/dev/dxgis present, andtorch.cuda.is_available()returnsTrue.However,
torch.cuda.mem_get_info()returns(4096, 16987488256)— only 4 KB freedespite 16 GB total VRAM, making it impossible to load any model.
Operating System
WSL2 Ubuntu 24.04.4
CPU
Ryzen 9 3900
GPU
RX 9060 XT
ROCm Version
7.2.1.70201-81~24.04
ROCm Component
No response
Steps to Reproduce
Reproduction Steps
amdgpu-install --usecase=rocm --no-dkmsHSA_ENABLE_DXG_DETECTION=1rocminfo— GPU is detected correctly (Agent 2, gfx1200)python -c "import torch; print(torch.cuda.mem_get_info())"→(4096, 16987488256)loaded partially; 0.00 MB usable, 0.00 MB loadedDebug Output
Running with
HSAKMT_DEBUG_LEVEL=7:(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support
WSL environment detected.
HSA System Attributes
Runtime Version: 1.18
Runtime Ext Version: 1.15
System Timestamp Freq.: 1000.000000MHz
Sig. Max Wait Duration: 18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model: LARGE
System Endianness: LITTLE
Mwaitx: DISABLED
XNACK enabled: NO
DMAbuf Support: YES
VMM Support: YES
==========
HSA Agents
Agent 1
Name: AMD Ryzen 9 3900 12-Core Processor
Uuid: CPU-XX
Marketing Name: AMD Ryzen 9 3900 12-Core Processor
Vendor Name: CPU
Feature: None specified
Profile: FULL_PROFILE
Float Round Mode: NEAR
Max Queue Number: 0(0x0)
Queue Min Size: 0(0x0)
Queue Max Size: 0(0x0)
Queue Type: MULTI
Node: 0
Device Type: CPU
Cache Info:
L1: 32768(0x8000) KB
Chip ID: 0(0x0)
Cacheline Size: 64(0x40)
BDFID: 0
Internal Node ID: 0
Compute Unit: 8
SIMDs per CU: 0
Shader Engines: 0
Shader Arrs. per Eng.: 0
Memory Properties:
Features: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: FINE GRAINED
Size: 28739804(0x1b688dc) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 2
Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED
Size: 28739804(0x1b688dc) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 3
Segment: GLOBAL; FLAGS: KERNARG, FINE GRAINED
Size: 28739804(0x1b688dc) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 4
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 28739804(0x1b688dc) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
ISA Info:
Agent 2
Name: gfx1200
Uuid: GPU-aa3d0bc8364af119
Marketing Name: AMD Radeon RX 9060 XT
Vendor Name: AMD
Feature: KERNEL_DISPATCH
Profile: BASE_PROFILE
Float Round Mode: NEAR
Max Queue Number: 128(0x80)
Queue Min Size: 64(0x40)
Queue Max Size: 131072(0x20000)
Queue Type: MULTI
Node: 1
Device Type: GPU
Cache Info:
L1: 32(0x20) KB
L3: 32768(0x8000) KB
Chip ID: 30096(0x7590)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 2700
BDFID: 3072
Internal Node ID: 1
Compute Unit: 32
SIMDs per CU: 2
Shader Engines: 2
Shader Arrs. per Eng.: 2
Coherent Host Access: FALSE
Memory Properties:
Features: KERNEL_DISPATCH
Fast F16 Operation: TRUE
Wavefront Size: 32(0x20)
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Max Waves Per CU: 32(0x20)
Max Work-item Per CU: 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 2147483647(0x7fffffff)
y 65535(0xffff)
z 65535(0xffff)
Max fbarriers/Workgrp: 32
Packet Processor uCode:: 108
SDMA engine uCode:: 0
IOMMU Support:: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 16589344(0xfd2220) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 2
Segment: GROUP
Size: 64(0x40) KB
Allocatable: FALSE
Alloc Granule: 0KB
Alloc Recommended Granule:0KB
Alloc Alignment: 0KB
Accessible by all: FALSE
ISA Info:
ISA 1
Name: amdgcn-am
Additional Information
[QuerySegmentInfo] Total Segments: 4
[GetSegmentId] Failed to get segment id for type 1
The GPU has 4 segments, but none of them matches
D3DKMT_QUERYSTATISTICS_SEGMENT_TYPE_MEMORY(type 1).The
GetSegmentIdcall inVramAvail()therefore always fails and returnsHSA_STATUS_ERROR.rocminfocorrectly reports 16 GB VRAM in Pool Info:Pool 1
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 16589344(0xfd2220) KB
Allocatable: TRUE
Root Cause Analysis
The issue is in
WDDMDevice::VramAvail()insrc/wddm/device.cpp:GetSegmentIditerates oversegment_infos_(populated byQuerySegmentInfo)and looks for a segment whose
segment_typematchesD3DKMT_QUERYSTATISTICS_SEGMENT_TYPE_MEMORY. However, on the RX 9060 XT (gfx1200),none of the 4 reported segments has this type — suggesting that
ParseAdapterInfo(in the closed-source
libthunk_proxy.a) is not correctly classifying the dedicatedVRAM segment as type
MEMORYfor this discrete RDNA 4 GPU in WSL2.As a result:
VramAvail()always returnsHSA_STATUS_ERRORmem_get_info()sees only 4 KB free (near-zero sentinel value)Additional Notes
torch.zeros(100).cuda()succeedsIsDgpu()returnstruecorrectly (the GPU is recognized as discrete)LocalHeapSize()returns the correct 16 GB value (used byrocminfo)ParseAdapterInfoinside
libthunk_proxy.a, which cannot be patched without access to its sourceThis issue appears specific to the discrete RX 9060 XT (gfx1200) in WSL2.
APU-based setups (Strix Halo, etc.) seem to have a different but related memory
mapping issue (#6022 on ROCm/ROCm).
Expected Behavior
torch.cuda.mem_get_info()should return approximately(16_000_000_000, 16_987_488_256),and model loading should use GPU VRAM as on native Linux or Windows.