Skip to content

Conversation

@codeflash-ai
Copy link

@codeflash-ai codeflash-ai bot commented Nov 29, 2025

📄 62% (0.62x) speedup for mxfp_supported in python/sglang/srt/utils/common.py

⏱️ Runtime : 1.48 milliseconds 911 microseconds (best of 34 runs)

📝 Explanation and details

The optimization replaces the any() generator expression with a direct string membership check, delivering a 62% speedup by eliminating unnecessary overhead.

Key optimization: Changed any(gfx in gcn_arch for gfx in ["gfx95"]) to "gfx95" in gcn_arch. Since the list contains only one element ("gfx95"), the generator expression and any() function call are redundant overhead.

Why this is faster:

  • Eliminates generator overhead: The original creates a generator object and iterator protocol machinery
  • Removes function call overhead: Direct string membership (in) is a highly optimized C operation vs Python's any() function
  • Simplifies execution path: One operation instead of generator creation + iteration + function call

Performance impact by test case:

  • Small strings (like "gfx950"): 66-85% faster due to reduced overhead dominating execution time
  • Large strings (1KB+): 22-86% faster as string search still benefits but becomes more dominant
  • Non-HIP platforms: 18% faster since the optimization doesn't affect the early return False path

Real-world impact: The function is called from mxfp4.py during quantization configuration setup. While not in a tight loop, this optimization reduces latency during model initialization, particularly beneficial when checking hardware capabilities multiple times across different model components.

The optimization maintains identical behavior - both implementations check if "gfx95" appears anywhere in the gcnArchName string, but the optimized version does it more efficiently.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 🔘 None Found
🌀 Generated Regression Tests 2038 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 🔘 None Found
📊 Tests Coverage 100.0%
🌀 Generated Regression Tests and Runtime
from __future__ import annotations

import sys

# imports
import pytest
import torch
import torch._custom_op.impl
import torch.distributed
import torch.library
from sglang.srt.utils.common import mxfp_supported

# unit tests

# --- Basic Test Cases ---

def test_mxfp_supported_returns_bool():
    # Test that the function always returns a boolean value
    codeflash_output = mxfp_supported(); result = codeflash_output # 863ns -> 645ns (33.8% faster)

def test_mxfp_supported_on_non_hip_platform(monkeypatch):
    # Simulate torch.version.hip being None or False (non-HIP platform)
    monkeypatch.setattr(torch.version, "hip", None)
    codeflash_output = mxfp_supported(); result = codeflash_output # 574ns -> 486ns (18.1% faster)

def test_mxfp_supported_on_hip_platform_non_gfx95(monkeypatch):
    # Simulate HIP platform but gcnArchName not containing "gfx95"
    class FakeProps:
        gcnArchName = "gfx900"
    monkeypatch.setattr(torch.version, "hip", "4.5.302")
    monkeypatch.setattr(torch.cuda, "get_device_properties", lambda idx: FakeProps())
    codeflash_output = mxfp_supported(); result = codeflash_output # 2.43μs -> 1.46μs (66.1% faster)

def test_mxfp_supported_on_hip_platform_gfx95(monkeypatch):
    # Simulate HIP platform and gcnArchName containing "gfx95"
    class FakeProps:
        gcnArchName = "gfx950"
    monkeypatch.setattr(torch.version, "hip", "4.5.302")
    monkeypatch.setattr(torch.cuda, "get_device_properties", lambda idx: FakeProps())
    codeflash_output = mxfp_supported(); result = codeflash_output # 2.62μs -> 1.42μs (84.8% faster)

def test_mxfp_supported_on_hip_platform_gfx950_and_gfx951(monkeypatch):
    # Simulate HIP platform and gcnArchName containing multiple gfx95 variants
    class FakeProps:
        gcnArchName = "gfx950 gfx951"
    monkeypatch.setattr(torch.version, "hip", "4.5.302")
    monkeypatch.setattr(torch.cuda, "get_device_properties", lambda idx: FakeProps())
    codeflash_output = mxfp_supported(); result = codeflash_output # 2.54μs -> 1.37μs (85.5% faster)

# --- Edge Test Cases ---

def test_mxfp_supported_on_hip_platform_empty_gcnArchName(monkeypatch):
    # Simulate HIP platform and empty gcnArchName
    class FakeProps:
        gcnArchName = ""
    monkeypatch.setattr(torch.version, "hip", "4.5.302")
    monkeypatch.setattr(torch.cuda, "get_device_properties", lambda idx: FakeProps())
    codeflash_output = mxfp_supported(); result = codeflash_output # 2.23μs -> 1.30μs (71.3% faster)

def test_mxfp_supported_on_hip_platform_gcnArchName_unusual(monkeypatch):
    # Simulate HIP platform and gcnArchName with unusual value
    class FakeProps:
        gcnArchName = "abc123"
    monkeypatch.setattr(torch.version, "hip", "4.5.302")
    monkeypatch.setattr(torch.cuda, "get_device_properties", lambda idx: FakeProps())
    codeflash_output = mxfp_supported(); result = codeflash_output # 2.79μs -> 1.78μs (56.8% faster)

def test_mxfp_supported_on_hip_platform_gcnArchName_substring(monkeypatch):
    # Simulate HIP platform and gcnArchName containing "gfx9500" (substring match)
    class FakeProps:
        gcnArchName = "gfx9500"
    monkeypatch.setattr(torch.version, "hip", "4.5.302")
    monkeypatch.setattr(torch.cuda, "get_device_properties", lambda idx: FakeProps())
    codeflash_output = mxfp_supported(); result = codeflash_output # 2.66μs -> 1.48μs (79.4% faster)

def test_mxfp_supported_on_hip_platform_gcnArchName_case_sensitive(monkeypatch):
    # Simulate HIP platform and gcnArchName containing "GFX95" (case sensitivity)
    class FakeProps:
        gcnArchName = "GFX95"
    monkeypatch.setattr(torch.version, "hip", "4.5.302")
    monkeypatch.setattr(torch.cuda, "get_device_properties", lambda idx: FakeProps())
    codeflash_output = mxfp_supported(); result = codeflash_output # 2.26μs -> 1.48μs (52.9% faster)

def test_mxfp_supported_on_hip_platform_multiple_devices(monkeypatch):
    # Simulate HIP platform and multiple devices (only device 0 is checked)
    class FakeProps:
        gcnArchName = "gfx950"
    monkeypatch.setattr(torch.version, "hip", "4.5.302")
    monkeypatch.setattr(torch.cuda, "get_device_properties", lambda idx: FakeProps())
    codeflash_output = mxfp_supported(); result = codeflash_output # 2.96μs -> 1.89μs (56.4% faster)

def test_mxfp_supported_on_hip_platform_get_device_properties_raises(monkeypatch):
    # Simulate HIP platform and get_device_properties raises an exception
    monkeypatch.setattr(torch.version, "hip", "4.5.302")
    def raise_exc(idx):
        raise RuntimeError("No device found")
    monkeypatch.setattr(torch.cuda, "get_device_properties", raise_exc)
    try:
        codeflash_output = mxfp_supported(); result = codeflash_output
    except RuntimeError as e:
        pass
    else:
        pass

# --- Large Scale Test Cases ---

def test_mxfp_supported_large_gcnArchName(monkeypatch):
    # Simulate HIP platform and very large gcnArchName string (but <100MB)
    large_str = "gfx95 " * 1000  # ~6KB
    class FakeProps:
        gcnArchName = large_str
    monkeypatch.setattr(torch.version, "hip", "4.5.302")
    monkeypatch.setattr(torch.cuda, "get_device_properties", lambda idx: FakeProps())
    codeflash_output = mxfp_supported(); result = codeflash_output # 2.80μs -> 1.50μs (86.2% faster)

def test_mxfp_supported_large_gcnArchName_no_match(monkeypatch):
    # Simulate HIP platform and very large gcnArchName string with no match
    large_str = "gfx900 " * 1000  # ~6KB
    class FakeProps:
        gcnArchName = large_str
    monkeypatch.setattr(torch.version, "hip", "4.5.302")
    monkeypatch.setattr(torch.cuda, "get_device_properties", lambda idx: FakeProps())
    codeflash_output = mxfp_supported(); result = codeflash_output # 4.57μs -> 3.71μs (22.9% faster)

def test_mxfp_supported_multiple_calls_consistency(monkeypatch):
    # Test that multiple calls with the same environment give the same result
    class FakeProps:
        gcnArchName = "gfx950"
    monkeypatch.setattr(torch.version, "hip", "4.5.302")
    monkeypatch.setattr(torch.cuda, "get_device_properties", lambda idx: FakeProps())
    results = [mxfp_supported() for _ in range(100)] # 2.58μs -> 1.46μs (76.6% faster)

def test_mxfp_supported_multiple_calls_different_environments(monkeypatch):
    # Test that multiple calls with different environments give correct results
    class FakeProps1:
        gcnArchName = "gfx950"
    class FakeProps2:
        gcnArchName = "gfx900"
    monkeypatch.setattr(torch.version, "hip", "4.5.302")
    monkeypatch.setattr(torch.cuda, "get_device_properties", lambda idx: FakeProps1())
    codeflash_output = mxfp_supported() # 2.52μs -> 1.41μs (79.0% faster)
    monkeypatch.setattr(torch.cuda, "get_device_properties", lambda idx: FakeProps2())
    codeflash_output = mxfp_supported() # 1.10μs -> 553ns (98.2% faster)

def test_mxfp_supported_performance_large_scale(monkeypatch):
    # Simulate HIP platform and call mxfp_supported 1000 times with large gcnArchName
    large_str = "gfx95 " * 1000
    class FakeProps:
        gcnArchName = large_str
    monkeypatch.setattr(torch.version, "hip", "4.5.302")
    monkeypatch.setattr(torch.cuda, "get_device_properties", lambda idx: FakeProps())
    for _ in range(1000):
        codeflash_output = mxfp_supported(); result = codeflash_output # 631μs -> 368μs (71.3% faster)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
from __future__ import annotations

# imports
import pytest  # used for our unit tests
import torch
import torch._custom_op.impl
import torch.distributed
import torch.library
from sglang.srt.utils.common import mxfp_supported

# unit tests

@pytest.mark.parametrize(
    "hip_version,gcn_arch,expected",
    [
        # Basic Test Cases
        # HIP not available, should return False
        (False, None, False),
        # HIP available, but gcnArchName does not contain gfx95, should return False
        (True, "gfx900", False),
        (True, "gfx803", False),
        # HIP available, gcnArchName contains gfx95, should return True
        (True, "gfx950", True),
        (True, "gfx951", True),
        (True, "gfx952", True),
        # HIP available, gcnArchName contains substring but not full, should return True if substring matches
        (True, "gfx950-something", True),
        (True, "something-gfx951", True),
        # HIP available, gcnArchName contains multiple gfx95 substrings
        (True, "gfx950gfx951", True),
        # HIP available, gcnArchName is exactly "gfx95"
        (True, "gfx95", True),
        # Edge Test Cases
        # HIP available, gcnArchName is empty string
        (True, "", False),
        # HIP available, gcnArchName is None (simulate error)
        (True, None, False),
        # HIP available, gcnArchName is not a string
        (True, 9595, False),
        # HIP available, gcnArchName is a string with similar but not matching substring
        (True, "gfx96", False),
        (True, "gfX950", False),  # Case sensitivity
        # Large Scale Test Cases
        # HIP available, gcnArchName is a long string with gfx95 somewhere
        (True, "a" * 500 + "gfx950" + "b" * 500, True),
        # HIP available, gcnArchName is a long string without gfx95
        (True, "a" * 1000, False),
    ]
)
def test_mxfp_supported(monkeypatch, hip_version, gcn_arch, expected):
    """
    Parametrized test for mxfp_supported covering basic, edge, and large scale cases.
    Uses monkeypatch to simulate torch.version.hip and torch.cuda.get_device_properties.
    """

    # Monkeypatch torch.version.hip
    class DummyVersion:
        def __init__(self, hip):
            self.hip = hip

    monkeypatch.setattr(torch, "version", DummyVersion(hip_version))

    # Monkeypatch torch.cuda.get_device_properties if hip_version is True
    if hip_version:
        class DummyDeviceProps:
            def __init__(self, gcnArchName):
                self.gcnArchName = gcnArchName

        def dummy_get_device_properties(idx):
            # idx is ignored, always returns DummyDeviceProps
            return DummyDeviceProps(gcn_arch)

        monkeypatch.setattr(torch.cuda, "get_device_properties", dummy_get_device_properties)
    else:
        # Remove get_device_properties if present, to simulate CUDA not available
        if hasattr(torch.cuda, "get_device_properties"):
            monkeypatch.delattr(torch.cuda, "get_device_properties")

    # Run the function and check the result
    codeflash_output = mxfp_supported(); result = codeflash_output # 40.8μs -> 25.6μs (59.2% faster)

def test_mxfp_supported_no_torch_version(monkeypatch):
    """
    Edge case: torch.version attribute missing.
    Should raise AttributeError.
    """
    monkeypatch.delattr(torch, "version", raising=False)
    with pytest.raises(AttributeError):
        mxfp_supported() # 2.19μs -> 2.22μs (1.31% slower)

def test_mxfp_supported_no_cuda(monkeypatch):
    """
    Edge case: torch.cuda.get_device_properties missing when hip is True.
    Should raise AttributeError.
    """
    class DummyVersion:
        hip = True
    monkeypatch.setattr(torch, "version", DummyVersion())
    # Remove get_device_properties
    if hasattr(torch.cuda, "get_device_properties"):
        monkeypatch.delattr(torch.cuda, "get_device_properties")
    with pytest.raises(AttributeError):
        mxfp_supported() # 2.27μs -> 2.20μs (3.32% faster)

def test_mxfp_supported_gcnArchName_attribute_missing(monkeypatch):
    """
    Edge case: torch.cuda.get_device_properties returns object without gcnArchName.
    Should raise AttributeError.
    """
    class DummyVersion:
        hip = True
    monkeypatch.setattr(torch, "version", DummyVersion())
    class DummyDeviceProps:
        pass  # No gcnArchName
    def dummy_get_device_properties(idx):
        return DummyDeviceProps()
    monkeypatch.setattr(torch.cuda, "get_device_properties", dummy_get_device_properties)
    with pytest.raises(AttributeError):
        mxfp_supported() # 2.08μs -> 2.06μs (1.22% faster)

def test_mxfp_supported_gcnArchName_is_none(monkeypatch):
    """
    Edge case: gcnArchName is None.
    Should return False.
    """
    class DummyVersion:
        hip = True
    monkeypatch.setattr(torch, "version", DummyVersion())
    class DummyDeviceProps:
        gcnArchName = None
    def dummy_get_device_properties(idx):
        return DummyDeviceProps()
    monkeypatch.setattr(torch.cuda, "get_device_properties", dummy_get_device_properties)
    codeflash_output = mxfp_supported()

def test_mxfp_supported_gcnArchName_is_not_str(monkeypatch):
    """
    Edge case: gcnArchName is not a string.
    Should return False.
    """
    class DummyVersion:
        hip = True
    monkeypatch.setattr(torch, "version", DummyVersion())
    class DummyDeviceProps:
        gcnArchName = 9595
    def dummy_get_device_properties(idx):
        return DummyDeviceProps()
    monkeypatch.setattr(torch.cuda, "get_device_properties", dummy_get_device_properties)
    codeflash_output = mxfp_supported()

def test_mxfp_supported_large_scale_many_calls(monkeypatch):
    """
    Large scale: Call mxfp_supported 1000 times with different gcnArchName values.
    Ensures efficiency and determinism.
    """
    class DummyVersion:
        hip = True
    monkeypatch.setattr(torch, "version", DummyVersion())
    class DummyDeviceProps:
        def __init__(self, gcnArchName):
            self.gcnArchName = gcnArchName
    def dummy_get_device_properties(idx):
        # idx used to vary gcnArchName
        if idx % 10 == 0:
            return DummyDeviceProps("gfx950")
        else:
            return DummyDeviceProps("gfx900")
    monkeypatch.setattr(torch.cuda, "get_device_properties", dummy_get_device_properties)
    # Call 1000 times
    for i in range(1000):
        expected = True if i % 10 == 0 else False
        codeflash_output = mxfp_supported(); result = codeflash_output # 756μs -> 483μs (56.6% faster)

def test_mxfp_supported_large_scale_long_gcnArchName(monkeypatch):
    """
    Large scale: gcnArchName is a very long string (1000 chars) containing 'gfx95' at the end.
    Should return True.
    """
    class DummyVersion:
        hip = True
    monkeypatch.setattr(torch, "version", DummyVersion())
    class DummyDeviceProps:
        gcnArchName = "a" * 995 + "gfx95"
    def dummy_get_device_properties(idx):
        return DummyDeviceProps()
    monkeypatch.setattr(torch.cuda, "get_device_properties", dummy_get_device_properties)
    codeflash_output = mxfp_supported() # 2.65μs -> 1.78μs (48.4% faster)

def test_mxfp_supported_large_scale_long_gcnArchName_no_match(monkeypatch):
    """
    Large scale: gcnArchName is a very long string (1000 chars) without 'gfx95'.
    Should return False.
    """
    class DummyVersion:
        hip = True
    monkeypatch.setattr(torch, "version", DummyVersion())
    class DummyDeviceProps:
        gcnArchName = "b" * 1000
    def dummy_get_device_properties(idx):
        return DummyDeviceProps()
    monkeypatch.setattr(torch.cuda, "get_device_properties", dummy_get_device_properties)
    codeflash_output = mxfp_supported() # 2.36μs -> 1.60μs (47.4% faster)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-mxfp_supported-mijuln50 and push.

Codeflash Static Badge

The optimization replaces the `any()` generator expression with a direct string membership check, delivering a **62% speedup** by eliminating unnecessary overhead.

**Key optimization:** Changed `any(gfx in gcn_arch for gfx in ["gfx95"])` to `"gfx95" in gcn_arch`. Since the list contains only one element ("gfx95"), the generator expression and `any()` function call are redundant overhead.

**Why this is faster:**
- **Eliminates generator overhead:** The original creates a generator object and iterator protocol machinery
- **Removes function call overhead:** Direct string membership (`in`) is a highly optimized C operation vs Python's `any()` function
- **Simplifies execution path:** One operation instead of generator creation + iteration + function call

**Performance impact by test case:**
- Small strings (like "gfx950"): 66-85% faster due to reduced overhead dominating execution time
- Large strings (1KB+): 22-86% faster as string search still benefits but becomes more dominant
- Non-HIP platforms: 18% faster since the optimization doesn't affect the early `return False` path

**Real-world impact:** The function is called from `mxfp4.py` during quantization configuration setup. While not in a tight loop, this optimization reduces latency during model initialization, particularly beneficial when checking hardware capabilities multiple times across different model components.

The optimization maintains identical behavior - both implementations check if "gfx95" appears anywhere in the `gcnArchName` string, but the optimized version does it more efficiently.
@codeflash-ai codeflash-ai bot requested a review from mashraf-222 November 29, 2025 05:25
@codeflash-ai codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash labels Nov 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant