Skip to content

[Issue]: [Bug] GetSegmentId fails for type 1 (MEMORY) on discrete RX 9060 XT (gfx1200) in WSL2 — VramAvail returns HSA_STATUS_ERROR, PyTorch sees 0 bytes free #22

@Hyroshima

Description

@Hyroshima

Problem Description

Environment

  • GPU: AMD Radeon RX 9060 XT (gfx1200 / RDNA 4 / Navi 44)
  • VRAM: 16 GB GDDR6
  • OS (Windows): Windows 10 22H2 (Build 19045)
  • AMD Driver: Adrenalin Edition 26.2.2
  • WSL2: Ubuntu 24.04 LTS
  • ROCm: 7.2.1
  • librocdxg: built from main branch (v1.1.1)
  • PyTorch: 2.9.1+rocm7.2.1

Problem Description

On a discrete RX 9060 XT GPU, GetSegmentId consistently fails for segment type 1
(D3DKMT_QUERYSTATISTICS_SEGMENT_TYPE_MEMORY), which is the local/dedicated VRAM segment.
This causes VramAvail() to return HSA_STATUS_ERROR, which PyTorch interprets as
0 bytes free, making all model loading fail with loaded partially; 0.00 MB usable.

The GPU is correctly detected by rocminfo, HSA_ENABLE_DXG_DETECTION=1 is set,
/dev/dxg is present, and torch.cuda.is_available() returns True.
However, torch.cuda.mem_get_info() returns (4096, 16987488256) — only 4 KB free
despite 16 GB total VRAM, making it impossible to load any model.

Operating System

WSL2 Ubuntu 24.04.4

CPU

Ryzen 9 3900

GPU

RX 9060 XT

ROCm Version

7.2.1.70201-81~24.04

ROCm Component

No response

Steps to Reproduce

Reproduction Steps

  1. Install ROCm 7.2.1 on WSL2 Ubuntu 24.04 with amdgpu-install --usecase=rocm --no-dkms
  2. Build and install librocdxg from source with Windows SDK path
  3. Set HSA_ENABLE_DXG_DETECTION=1
  4. Run rocminfo — GPU is detected correctly (Agent 2, gfx1200)
  5. Run python -c "import torch; print(torch.cuda.mem_get_info())"(4096, 16987488256)
  6. Try loading any model in PyTorch or ComfyUI → loaded partially; 0.00 MB usable, 0.00 MB loaded

Debug Output

Running with HSAKMT_DEBUG_LEVEL=7:

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

WSL environment detected.

HSA System Attributes

Runtime Version: 1.18
Runtime Ext Version: 1.15
System Timestamp Freq.: 1000.000000MHz
Sig. Max Wait Duration: 18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model: LARGE
System Endianness: LITTLE
Mwaitx: DISABLED
XNACK enabled: NO
DMAbuf Support: YES
VMM Support: YES

==========
HSA Agents


Agent 1


Name: AMD Ryzen 9 3900 12-Core Processor
Uuid: CPU-XX
Marketing Name: AMD Ryzen 9 3900 12-Core Processor
Vendor Name: CPU
Feature: None specified
Profile: FULL_PROFILE
Float Round Mode: NEAR
Max Queue Number: 0(0x0)
Queue Min Size: 0(0x0)
Queue Max Size: 0(0x0)
Queue Type: MULTI
Node: 0
Device Type: CPU
Cache Info:
L1: 32768(0x8000) KB
Chip ID: 0(0x0)
Cacheline Size: 64(0x40)
BDFID: 0
Internal Node ID: 0
Compute Unit: 8
SIMDs per CU: 0
Shader Engines: 0
Shader Arrs. per Eng.: 0
Memory Properties:
Features: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: FINE GRAINED
Size: 28739804(0x1b688dc) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 2
Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED
Size: 28739804(0x1b688dc) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 3
Segment: GLOBAL; FLAGS: KERNARG, FINE GRAINED
Size: 28739804(0x1b688dc) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 4
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 28739804(0x1b688dc) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
ISA Info:


Agent 2


Name: gfx1200
Uuid: GPU-aa3d0bc8364af119
Marketing Name: AMD Radeon RX 9060 XT
Vendor Name: AMD
Feature: KERNEL_DISPATCH
Profile: BASE_PROFILE
Float Round Mode: NEAR
Max Queue Number: 128(0x80)
Queue Min Size: 64(0x40)
Queue Max Size: 131072(0x20000)
Queue Type: MULTI
Node: 1
Device Type: GPU
Cache Info:
L1: 32(0x20) KB
L3: 32768(0x8000) KB
Chip ID: 30096(0x7590)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 2700
BDFID: 3072
Internal Node ID: 1
Compute Unit: 32
SIMDs per CU: 2
Shader Engines: 2
Shader Arrs. per Eng.: 2
Coherent Host Access: FALSE
Memory Properties:
Features: KERNEL_DISPATCH
Fast F16 Operation: TRUE
Wavefront Size: 32(0x20)
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Max Waves Per CU: 32(0x20)
Max Work-item Per CU: 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 2147483647(0x7fffffff)
y 65535(0xffff)
z 65535(0xffff)
Max fbarriers/Workgrp: 32
Packet Processor uCode:: 108
SDMA engine uCode:: 0
IOMMU Support:: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 16589344(0xfd2220) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 2
Segment: GROUP
Size: 64(0x40) KB
Allocatable: FALSE
Alloc Granule: 0KB
Alloc Recommended Granule:0KB
Alloc Alignment: 0KB
Accessible by all: FALSE
ISA Info:
ISA 1
Name: amdgcn-am

Additional Information

[QuerySegmentInfo] Total Segments: 4
[GetSegmentId] Failed to get segment id for type 1

The GPU has 4 segments, but none of them matches
D3DKMT_QUERYSTATISTICS_SEGMENT_TYPE_MEMORY (type 1).
The GetSegmentId call in VramAvail() therefore always fails and returns
HSA_STATUS_ERROR.

rocminfo correctly reports 16 GB VRAM in Pool Info:

Pool 1
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 16589344(0xfd2220) KB
Allocatable: TRUE


Root Cause Analysis

The issue is in WDDMDevice::VramAvail() in src/wddm/device.cpp:

if (!GetSegmentId(D3DKMT_QUERYSTATISTICS_SEGMENT_TYPE_MEMORY, segmentId))
    return HSA_STATUS_ERROR;

GetSegmentId iterates over segment_infos_ (populated by QuerySegmentInfo)
and looks for a segment whose segment_type matches
D3DKMT_QUERYSTATISTICS_SEGMENT_TYPE_MEMORY. However, on the RX 9060 XT (gfx1200),
none of the 4 reported segments has this type — suggesting that ParseAdapterInfo
(in the closed-source libthunk_proxy.a) is not correctly classifying the dedicated
VRAM segment as type MEMORY for this discrete RDNA 4 GPU in WSL2.

As a result:

  • VramAvail() always returns HSA_STATUS_ERROR
  • PyTorch's mem_get_info() sees only 4 KB free (near-zero sentinel value)
  • All model inference fails — GPU compute works fine but memory allocation does not

Additional Notes

  • Small allocations work: torch.zeros(100).cuda() succeeds
  • Large allocations fail silently (process hangs with no output)
  • IsDgpu() returns true correctly (the GPU is recognized as discrete)
  • LocalHeapSize() returns the correct 16 GB value (used by rocminfo)
  • The problem is isolated to the segment type classification in ParseAdapterInfo
    inside libthunk_proxy.a, which cannot be patched without access to its source

This issue appears specific to the discrete RX 9060 XT (gfx1200) in WSL2.
APU-based setups (Strix Halo, etc.) seem to have a different but related memory
mapping issue (#6022 on ROCm/ROCm).


Expected Behavior

torch.cuda.mem_get_info() should return approximately (16_000_000_000, 16_987_488_256),
and model loading should use GPU VRAM as on native Linux or Windows.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions