Skip to content

Conversation

@codeflash-ai
Copy link

@codeflash-ai codeflash-ai bot commented Nov 29, 2025

📄 34% (0.34x) speedup for is_gfx95_supported in python/sglang/srt/utils/common.py

⏱️ Runtime : 963 microseconds 718 microseconds (best of 59 runs)

📝 Explanation and details

The optimization replaces any(gfx in gcn_arch for gfx in ["gfx95"]) with a direct substring check "gfx95" in gcn_arch, achieving a 34% speedup.

Key optimization:

  • Eliminated unnecessary iteration: The original code creates a generator that iterates over a single-item list ["gfx95"], then uses any() to evaluate it. The optimized version performs a direct substring search.
  • Reduced function call overhead: Removes the any() function call and generator expression overhead.
  • More efficient string search: Python's in operator for substring checking is highly optimized in C and faster than iterating over a list with one element.

Performance characteristics:

  • Best case scenarios: Test results show 16-78% improvements when the HIP check passes and device properties are accessed, particularly for cases with longer gcnArchName strings where the substring search efficiency matters most.
  • Marginal overhead in some edge cases: A few tests show slight slowdowns (1-16%), likely due to test setup variance, but the overall pattern shows consistent improvement.
  • Cache effectiveness maintained: The @lru_cache(maxsize=1) ensures the optimization benefit is realized on first call, with subsequent calls being near-instantaneous.

Impact on workloads:
Since this function checks GPU architecture support, it's likely called during initialization or capability detection phases. The 34% improvement reduces latency in GPU setup paths, particularly beneficial in scenarios where multiple architecture checks occur or in environments with frequent re-initialization.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 🔘 None Found
🌀 Generated Regression Tests 2033 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 🔘 None Found
📊 Tests Coverage 100.0%
🌀 Generated Regression Tests and Runtime
import builtins
import sys
import types
# function to test
from functools import lru_cache

# imports
import pytest
from sglang.srt.utils.common import is_gfx95_supported

# We'll define a minimal mock torch module for testing, since we can't guarantee the presence of HIP or CUDA on the test machine.
class DummyDeviceProps:
    def __init__(self, gcnArchName):
        self.gcnArchName = gcnArchName

class DummyCuda:
    def __init__(self, gcnArchName=None, raise_on_call=False):
        self._gcnArchName = gcnArchName
        self._raise_on_call = raise_on_call

    def get_device_properties(self, idx):
        if self._raise_on_call:
            raise RuntimeError("No CUDA device")
        return DummyDeviceProps(self._gcnArchName)

class DummyTorch:
    def __init__(self, hip=False, gcnArchName=None, raise_on_call=False):
        self.version = type("ver", (), {"hip": hip})()
        self.cuda = DummyCuda(gcnArchName, raise_on_call)
from sglang.srt.utils.common import is_gfx95_supported

# Helper to monkeypatch sys.modules['torch'] for the duration of a test
class TorchPatcher:
    def __init__(self, dummy_torch):
        self.dummy_torch = dummy_torch
        self.orig_torch = sys.modules.get("torch", None)

    def __enter__(self):
        sys.modules["torch"] = self.dummy_torch

    def __exit__(self, exc_type, exc_val, exc_tb):
        if self.orig_torch is not None:
            sys.modules["torch"] = self.orig_torch
        else:
            del sys.modules["torch"]

# Helper to clear lru_cache between tests
def clear_is_gfx95_supported_cache():
    is_gfx95_supported.cache_clear()

# =========================
# BASIC TEST CASES
# =========================

def test_returns_false_when_not_hip():
    """Should return False if torch.version.hip is False, regardless of gcnArchName."""
    clear_is_gfx95_supported_cache()
    dummy = DummyTorch(hip=False, gcnArchName="gfx950")
    with TorchPatcher(dummy):
        codeflash_output = is_gfx95_supported() # 1.15μs -> 1.17μs (1.63% slower)

def test_returns_true_when_hip_and_gfx95_in_arch():
    """Should return True if torch.version.hip is True and gcnArchName contains 'gfx95'."""
    clear_is_gfx95_supported_cache()
    dummy = DummyTorch(hip=True, gcnArchName="gfx950")
    with TorchPatcher(dummy):
        codeflash_output = is_gfx95_supported() # 912ns -> 1.07μs (14.4% slower)

def test_returns_false_when_hip_and_gfx95_not_in_arch():
    """Should return False if torch.version.hip is True and gcnArchName does not contain 'gfx95'."""
    clear_is_gfx95_supported_cache()
    dummy = DummyTorch(hip=True, gcnArchName="gfx90a")
    with TorchPatcher(dummy):
        codeflash_output = is_gfx95_supported() # 887ns -> 932ns (4.83% slower)

def test_returns_true_when_hip_and_gfx95_is_substring():
    """Should return True if 'gfx95' is a substring of gcnArchName (e.g., 'gfx950x')."""
    clear_is_gfx95_supported_cache()
    dummy = DummyTorch(hip=True, gcnArchName="gfx950x")
    with TorchPatcher(dummy):
        codeflash_output = is_gfx95_supported() # 836ns -> 943ns (11.3% slower)

def test_returns_false_when_hip_and_gcnArchName_is_empty():
    """Should return False if torch.version.hip is True and gcnArchName is empty string."""
    clear_is_gfx95_supported_cache()
    dummy = DummyTorch(hip=True, gcnArchName="")
    with TorchPatcher(dummy):
        codeflash_output = is_gfx95_supported() # 850ns -> 935ns (9.09% slower)

# =========================
# EDGE TEST CASES
# =========================

def test_gcnArchName_case_sensitivity():
    """Should be case sensitive: 'GFX95' does not match 'gfx95'."""
    clear_is_gfx95_supported_cache()
    dummy = DummyTorch(hip=True, gcnArchName="GFX950")
    with TorchPatcher(dummy):
        codeflash_output = is_gfx95_supported() # 799ns -> 890ns (10.2% slower)

def test_gcnArchName_is_none():
    """Should not raise error if gcnArchName is None, should return False."""
    clear_is_gfx95_supported_cache()
    dummy = DummyTorch(hip=True, gcnArchName=None)
    with TorchPatcher(dummy):
        try:
            codeflash_output = is_gfx95_supported(); result = codeflash_output
        except Exception as e:
            pass

def test_gcnArchName_is_non_string():
    """Should not raise error if gcnArchName is int, should return False."""
    clear_is_gfx95_supported_cache()
    dummy = DummyTorch(hip=True, gcnArchName=9595)
    with TorchPatcher(dummy):
        try:
            codeflash_output = is_gfx95_supported(); result = codeflash_output
        except Exception as e:
            pass

def test_multiple_gfx95_matches():
    """Should return True if gcnArchName contains multiple 'gfx95' substrings."""
    clear_is_gfx95_supported_cache()
    dummy = DummyTorch(hip=True, gcnArchName="gfx95gfx95")
    with TorchPatcher(dummy):
        codeflash_output = is_gfx95_supported() # 949ns -> 1.13μs (16.0% slower)

# =========================
# LARGE SCALE TEST CASES
# =========================

def test_large_gcnArchName_with_gfx95_at_start():
    """Should return True if large gcnArchName string starts with 'gfx95'."""
    clear_is_gfx95_supported_cache()
    arch = "gfx95" + "x" * 995  # 1000 chars
    dummy = DummyTorch(hip=True, gcnArchName=arch)
    with TorchPatcher(dummy):
        codeflash_output = is_gfx95_supported() # 904ns -> 1.01μs (10.8% slower)

def test_large_gcnArchName_with_gfx95_at_end():
    """Should return True if large gcnArchName string ends with 'gfx95'."""
    clear_is_gfx95_supported_cache()
    arch = "x" * 995 + "gfx95"
    dummy = DummyTorch(hip=True, gcnArchName=arch)
    with TorchPatcher(dummy):
        codeflash_output = is_gfx95_supported() # 887ns -> 877ns (1.14% faster)

def test_large_gcnArchName_without_gfx95():
    """Should return False if large gcnArchName string does not contain 'gfx95'."""
    clear_is_gfx95_supported_cache()
    arch = "x" * 1000
    dummy = DummyTorch(hip=True, gcnArchName=arch)
    with TorchPatcher(dummy):
        codeflash_output = is_gfx95_supported() # 883ns -> 971ns (9.06% slower)

def test_large_number_of_calls_cache():
    """Should always return the cached value after first call, even if torch is monkeypatched."""
    clear_is_gfx95_supported_cache()
    dummy1 = DummyTorch(hip=True, gcnArchName="gfx950")
    dummy2 = DummyTorch(hip=False, gcnArchName="gfx950")
    with TorchPatcher(dummy1):
        codeflash_output = is_gfx95_supported() # 868ns -> 889ns (2.36% slower)
    # Now patch to dummy2, but due to lru_cache, result should still be True
    with TorchPatcher(dummy2):
        codeflash_output = is_gfx95_supported() # 241ns -> 237ns (1.69% faster)
    # Clear cache, now should return False
    clear_is_gfx95_supported_cache()
    with TorchPatcher(dummy2):
        codeflash_output = is_gfx95_supported() # 365ns -> 414ns (11.8% slower)

# =========================
# ADDITIONAL EDGE CASES
# =========================

def test_gcnArchName_is_list():
    """Should not raise error if gcnArchName is a list, should return False."""
    clear_is_gfx95_supported_cache()
    dummy = DummyTorch(hip=True, gcnArchName=["gfx950"])
    with TorchPatcher(dummy):
        try:
            codeflash_output = is_gfx95_supported(); result = codeflash_output
        except Exception as e:
            pass

def test_gcnArchName_is_dict():
    """Should not raise error if gcnArchName is a dict, should return False."""
    clear_is_gfx95_supported_cache()
    dummy = DummyTorch(hip=True, gcnArchName={"gfx": 95})
    with TorchPatcher(dummy):
        try:
            codeflash_output = is_gfx95_supported(); result = codeflash_output
        except Exception as e:
            pass
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
from functools import lru_cache

# imports
import pytest  # used for our unit tests
# We'll need to patch torch and its submodules for controlled testing.
import torch
from sglang.srt.utils.common import is_gfx95_supported

# unit tests
@pytest.mark.basic
def test_no_hip_returns_false(monkeypatch):
    # Basic test: torch.version.hip is False, should return False
    monkeypatch.setattr(torch.version, "hip", False)
    # Clear lru_cache so monkeypatch works
    is_gfx95_supported.cache_clear()
    codeflash_output = is_gfx95_supported() # 964ns -> 825ns (16.8% faster)

@pytest.mark.basic
def test_hip_true_gfx95_in_gcnarch(monkeypatch):
    # Basic test: torch.version.hip is True and gcnArchName contains 'gfx95'
    monkeypatch.setattr(torch.version, "hip", True)

    class DummyProps:
        gcnArchName = "gfx950"

    monkeypatch.setattr(torch.cuda, "get_device_properties", lambda idx: DummyProps())
    is_gfx95_supported.cache_clear()
    codeflash_output = is_gfx95_supported() # 3.20μs -> 1.98μs (62.0% faster)

@pytest.mark.basic
def test_hip_true_gfx95_not_in_gcnarch(monkeypatch):
    # Basic test: torch.version.hip is True and gcnArchName does NOT contain 'gfx95'
    monkeypatch.setattr(torch.version, "hip", True)

    class DummyProps:
        gcnArchName = "gfx906"

    monkeypatch.setattr(torch.cuda, "get_device_properties", lambda idx: DummyProps())
    is_gfx95_supported.cache_clear()
    codeflash_output = is_gfx95_supported() # 2.78μs -> 1.76μs (57.6% faster)

@pytest.mark.edge
def test_hip_true_gcnarch_empty(monkeypatch):
    # Edge case: gcnArchName is empty string
    monkeypatch.setattr(torch.version, "hip", True)

    class DummyProps:
        gcnArchName = ""

    monkeypatch.setattr(torch.cuda, "get_device_properties", lambda idx: DummyProps())
    is_gfx95_supported.cache_clear()
    codeflash_output = is_gfx95_supported() # 2.62μs -> 1.66μs (58.0% faster)

@pytest.mark.edge
def test_hip_true_gcnarch_multiple_gfx95(monkeypatch):
    # Edge case: gcnArchName contains multiple 'gfx95' substrings
    monkeypatch.setattr(torch.version, "hip", True)

    class DummyProps:
        gcnArchName = "gfx950_gfx951_gfx952"

    monkeypatch.setattr(torch.cuda, "get_device_properties", lambda idx: DummyProps())
    is_gfx95_supported.cache_clear()
    codeflash_output = is_gfx95_supported() # 2.83μs -> 1.74μs (62.2% faster)

@pytest.mark.edge
def test_hip_true_gcnarch_case_sensitive(monkeypatch):
    # Edge case: gcnArchName contains 'GFX95' (uppercase), should NOT match
    monkeypatch.setattr(torch.version, "hip", True)

    class DummyProps:
        gcnArchName = "GFX950"

    monkeypatch.setattr(torch.cuda, "get_device_properties", lambda idx: DummyProps())
    is_gfx95_supported.cache_clear()
    codeflash_output = is_gfx95_supported() # 2.67μs -> 1.74μs (53.7% faster)

@pytest.mark.edge
def test_hip_true_gcnarch_partial_match(monkeypatch):
    # Edge case: gcnArchName contains 'gfx9' but not 'gfx95'
    monkeypatch.setattr(torch.version, "hip", True)

    class DummyProps:
        gcnArchName = "gfx9"

    monkeypatch.setattr(torch.cuda, "get_device_properties", lambda idx: DummyProps())
    is_gfx95_supported.cache_clear()
    codeflash_output = is_gfx95_supported() # 2.58μs -> 1.58μs (63.8% faster)

@pytest.mark.edge
def test_hip_true_gcnarch_gfx95_at_end(monkeypatch):
    # Edge case: gcnArchName ends with 'gfx95'
    monkeypatch.setattr(torch.version, "hip", True)

    class DummyProps:
        gcnArchName = "arch_gfx95"

    monkeypatch.setattr(torch.cuda, "get_device_properties", lambda idx: DummyProps())
    is_gfx95_supported.cache_clear()
    codeflash_output = is_gfx95_supported() # 2.96μs -> 1.76μs (68.0% faster)

@pytest.mark.edge
def test_hip_true_gcnarch_gfx95_at_start(monkeypatch):
    # Edge case: gcnArchName starts with 'gfx95'
    monkeypatch.setattr(torch.version, "hip", True)

    class DummyProps:
        gcnArchName = "gfx95_arch"

    monkeypatch.setattr(torch.cuda, "get_device_properties", lambda idx: DummyProps())
    is_gfx95_supported.cache_clear()
    codeflash_output = is_gfx95_supported() # 2.99μs -> 1.67μs (78.7% faster)

@pytest.mark.edge

def test_large_scale_many_calls(monkeypatch):
    # Large scale: call the function many times, ensure lru_cache works and result is stable
    monkeypatch.setattr(torch.version, "hip", True)

    class DummyProps:
        gcnArchName = "gfx950"

    monkeypatch.setattr(torch.cuda, "get_device_properties", lambda idx: DummyProps())
    is_gfx95_supported.cache_clear()
    codeflash_output = is_gfx95_supported(); result = codeflash_output # 3.55μs -> 2.31μs (53.6% faster)
    for _ in range(1000):  # Should not exceed 1000 steps
        codeflash_output = is_gfx95_supported() # 117μs -> 115μs (1.49% faster)

@pytest.mark.large_scale
def test_large_scale_varied_gcnarch(monkeypatch):
    # Large scale: test with a list of 1000 different gcnArchName strings
    monkeypatch.setattr(torch.version, "hip", True)
    class DummyProps:
        def __init__(self, name):
            self.gcnArchName = name

    # Prepare 1000 names, only one contains 'gfx95'
    names = ["gfx900"] * 999 + ["gfx95"]
    is_gfx95_supported.cache_clear()
    for i, name in enumerate(names):
        monkeypatch.setattr(torch.cuda, "get_device_properties", lambda idx, name=name: DummyProps(name))
        # Clear cache each time to force re-evaluation
        is_gfx95_supported.cache_clear()
        if "gfx95" in name:
            codeflash_output = is_gfx95_supported()
        else:
            codeflash_output = is_gfx95_supported()

@pytest.mark.large_scale
def test_large_scale_long_gcnarch_string(monkeypatch):
    # Large scale: gcnArchName is a very long string (but <100MB)
    monkeypatch.setattr(torch.version, "hip", True)

    long_str = "gfx900_" * 1000 + "gfx95" + "_gfx900" * 1000
    class DummyProps:
        gcnArchName = long_str

    monkeypatch.setattr(torch.cuda, "get_device_properties", lambda idx: DummyProps())
    is_gfx95_supported.cache_clear()
    codeflash_output = is_gfx95_supported() # 5.44μs -> 4.26μs (27.9% faster)

@pytest.mark.edge

def test_torch_version_hip_missing(monkeypatch):
    # Edge case: torch.version.hip attribute does not exist
    monkeypatch.delattr(torch.version, "hip", raising=False)
    is_gfx95_supported.cache_clear()
    # Should raise AttributeError
    try:
        is_gfx95_supported()
    except AttributeError:
        pass

# Additional edge: torch.cuda.get_device_properties returns object without gcnArchName
@pytest.mark.edge
def test_gcnarchname_missing(monkeypatch):
    # Edge case: get_device_properties returns object without gcnArchName
    monkeypatch.setattr(torch.version, "hip", True)
    class DummyProps:
        pass
    monkeypatch.setattr(torch.cuda, "get_device_properties", lambda idx: DummyProps())
    is_gfx95_supported.cache_clear()
    try:
        is_gfx95_supported()
    except AttributeError:
        pass

# Additional edge: torch.cuda.get_device_properties returns None
@pytest.mark.edge
def test_get_device_properties_returns_none(monkeypatch):
    # Edge case: get_device_properties returns None
    monkeypatch.setattr(torch.version, "hip", True)
    monkeypatch.setattr(torch.cuda, "get_device_properties", lambda idx: None)
    is_gfx95_supported.cache_clear()
    try:
        is_gfx95_supported()
    except AttributeError:
        pass

# Additional edge: torch.version.hip is True but torch.cuda.get_device_properties not callable
@pytest.mark.edge
def test_get_device_properties_not_callable(monkeypatch):
    # Edge case: torch.cuda.get_device_properties is not callable
    monkeypatch.setattr(torch.version, "hip", True)
    monkeypatch.setattr(torch.cuda, "get_device_properties", None)
    is_gfx95_supported.cache_clear()
    try:
        is_gfx95_supported()
    except TypeError:
        pass
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-is_gfx95_supported-mijuthfe and push.

Codeflash Static Badge

The optimization replaces `any(gfx in gcn_arch for gfx in ["gfx95"])` with a direct substring check `"gfx95" in gcn_arch`, achieving a **34% speedup**.

**Key optimization:**
- **Eliminated unnecessary iteration**: The original code creates a generator that iterates over a single-item list `["gfx95"]`, then uses `any()` to evaluate it. The optimized version performs a direct substring search.
- **Reduced function call overhead**: Removes the `any()` function call and generator expression overhead.
- **More efficient string search**: Python's `in` operator for substring checking is highly optimized in C and faster than iterating over a list with one element.

**Performance characteristics:**
- **Best case scenarios**: Test results show 16-78% improvements when the HIP check passes and device properties are accessed, particularly for cases with longer `gcnArchName` strings where the substring search efficiency matters most.
- **Marginal overhead in some edge cases**: A few tests show slight slowdowns (1-16%), likely due to test setup variance, but the overall pattern shows consistent improvement.
- **Cache effectiveness maintained**: The `@lru_cache(maxsize=1)` ensures the optimization benefit is realized on first call, with subsequent calls being near-instantaneous.

**Impact on workloads:**
Since this function checks GPU architecture support, it's likely called during initialization or capability detection phases. The 34% improvement reduces latency in GPU setup paths, particularly beneficial in scenarios where multiple architecture checks occur or in environments with frequent re-initialization.
@codeflash-ai codeflash-ai bot requested a review from mashraf-222 November 29, 2025 05:32
@codeflash-ai codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash labels Nov 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant