⚡️ Speed up function _compute_moe_deepseek_blog_decode by 19%
#337
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
📄 19% (0.19x) speedup for
_compute_moe_deepseek_blog_decodeinpython/sglang/srt/operations_strategy.py⏱️ Runtime :
3.21 milliseconds→2.70 milliseconds(best of125runs)📝 Explanation and details
The optimization achieves an 18% speedup by eliminating repeated attribute lookups during list construction. The key changes are:
What was optimized:
layer.self_attn,layer.mlp, andoperations.YieldOperationare bound to local variables (self_attn,mlp,y_op) before constructing the operations listlayer.mlp.op_gatemultiple times, the code now accessesmlp.op_gateafter bindingmlp = layer.mlpWhy this is faster:
In Python, attribute access involves dictionary lookups and method resolution. By binding
layer.self_attnandlayer.mlpto local variables, the code reduces:layer.self_attn.*operations (now justself_attn.*)layer.mlp.*operations (now justmlp.*)operations.YieldOperation()(now justy_op())Local variable access is significantly faster than attribute access in Python, as it uses direct array indexing rather than dictionary lookups.
Impact on workloads:
Based on the function reference, this optimization is called during both DECODE and TARGET_VERIFY forward modes for MoE DeepSeek layers. Since this function constructs the operations strategy for neural network layer execution, the 18% improvement will benefit:
Test case performance:
The optimization shows consistent 12-28% improvements across various test scenarios, with particularly strong gains (45-58%) in error cases where early attribute access fails, suggesting the local binding overhead is minimal compared to the lookup savings.
✅ Correctness verification report:
🌀 Generated Regression Tests and Runtime
import pytest
from sglang.srt.operations_strategy import _compute_moe_deepseek_blog_decode
Function and dependencies to test
class OperationsStrategy:
"""Minimal stub to simulate the expected behavior and attributes."""
def init(self, deep_gemm_num_sms, tbo_delta_stages, operations):
self.deep_gemm_num_sms = deep_gemm_num_sms
self.tbo_delta_stages = tbo_delta_stages
self.operations = operations
class YieldOperation:
"""Dummy operation to simulate a yield point in the operations list."""
def repr(self):
return "YieldOperation()"
class DummyOp:
"""Dummy operation to simulate layer operations."""
def init(self, name):
self.name = name
def repr(self):
return f"DummyOp({self.name})"
def eq(self, other):
return isinstance(other, DummyOp) and self.name == other.name
Simulate the
operationsmoduleclass operations:
YieldOperation = YieldOperation
from sglang.srt.operations_strategy import _compute_moe_deepseek_blog_decode
Dummy layer and submodules to simulate input for tests
class DummySelfAttn:
def init(self):
self.op_prepare = DummyOp("self_attn.op_prepare")
self.op_core = DummyOp("self_attn.op_core")
class DummyMLP:
def init(self):
self.op_gate = DummyOp("mlp.op_gate")
self.op_select_experts = DummyOp("mlp.op_select_experts")
self.op_dispatch_a = DummyOp("mlp.op_dispatch_a")
self.op_shared_experts = DummyOp("mlp.op_shared_experts")
self.op_dispatch_b = DummyOp("mlp.op_dispatch_b")
self.op_experts = DummyOp("mlp.op_experts")
self.op_combine_a = DummyOp("mlp.op_combine_a")
self.op_combine_b = DummyOp("mlp.op_combine_b")
self.op_output = DummyOp("mlp.op_output")
class DummyLayer:
def init(self):
self.op_comm_prepare_attn = DummyOp("op_comm_prepare_attn")
self.self_attn = DummySelfAttn()
self.op_comm_prepare_mlp = DummyOp("op_comm_prepare_mlp")
self.mlp = DummyMLP()
self.op_comm_postprocess_layer = DummyOp("op_comm_postprocess_layer")
---- Unit Tests ----
1. Basic Test Cases
def test_basic_structure_and_order():
"""Test that the function returns the correct structure and operation order."""
layer = DummyLayer()
codeflash_output = _compute_moe_deepseek_blog_decode(layer); strategy = codeflash_output # 2.50μs -> 2.08μs (20.2% faster)
# Check correct order of operations
expected_ops = [
layer.op_comm_prepare_attn,
layer.self_attn.op_prepare,
YieldOperation(),
layer.self_attn.op_core,
layer.op_comm_prepare_mlp,
layer.mlp.op_gate,
layer.mlp.op_select_experts,
YieldOperation(),
layer.mlp.op_dispatch_a,
layer.mlp.op_shared_experts,
YieldOperation(),
layer.mlp.op_dispatch_b,
layer.mlp.op_experts,
layer.mlp.op_combine_a,
YieldOperation(),
layer.mlp.op_combine_b,
YieldOperation(),
layer.mlp.op_output,
layer.op_comm_postprocess_layer,
]
# Compare reprs to avoid object identity issues for YieldOperation
actual_repr = [repr(op) for op in strategy.operations]
expected_repr = [repr(op) for op in expected_ops]
def test_basic_with_different_ops():
"""Test that the function adapts to different layer operation objects."""
class AltDummySelfAttn:
def init(self):
self.op_prepare = DummyOp("alt_self_attn.op_prepare")
self.op_core = DummyOp("alt_self_attn.op_core")
class AltDummyMLP:
def init(self):
self.op_gate = DummyOp("alt_mlp.op_gate")
self.op_select_experts = DummyOp("alt_mlp.op_select_experts")
self.op_dispatch_a = DummyOp("alt_mlp.op_dispatch_a")
self.op_shared_experts = DummyOp("alt_mlp.op_shared_experts")
self.op_dispatch_b = DummyOp("alt_mlp.op_dispatch_b")
self.op_experts = DummyOp("alt_mlp.op_experts")
self.op_combine_a = DummyOp("alt_mlp.op_combine_a")
self.op_combine_b = DummyOp("alt_mlp.op_combine_b")
self.op_output = DummyOp("alt_mlp.op_output")
class AltDummyLayer:
def init(self):
self.op_comm_prepare_attn = DummyOp("alt_op_comm_prepare_attn")
self.self_attn = AltDummySelfAttn()
self.op_comm_prepare_mlp = DummyOp("alt_op_comm_prepare_mlp")
self.mlp = AltDummyMLP()
self.op_comm_postprocess_layer = DummyOp("alt_op_comm_postprocess_layer")
layer = AltDummyLayer()
codeflash_output = _compute_moe_deepseek_blog_decode(layer); strategy = codeflash_output # 2.57μs -> 2.17μs (18.6% faster)
2. Edge Test Cases
def test_missing_self_attn_raises():
"""Test that missing self_attn attribute raises an AttributeError."""
class BadLayer:
def init(self):
self.op_comm_prepare_attn = DummyOp("op_comm_prepare_attn")
self.op_comm_prepare_mlp = DummyOp("op_comm_prepare_mlp")
self.mlp = DummyMLP()
self.op_comm_postprocess_layer = DummyOp("op_comm_postprocess_layer")
layer = BadLayer()
with pytest.raises(AttributeError):
_compute_moe_deepseek_blog_decode(layer) # 1.48μs -> 1.47μs (0.611% faster)
def test_missing_mlp_raises():
"""Test that missing mlp attribute raises an AttributeError."""
class BadLayer:
def init(self):
self.op_comm_prepare_attn = DummyOp("op_comm_prepare_attn")
self.self_attn = DummySelfAttn()
self.op_comm_prepare_mlp = DummyOp("op_comm_prepare_mlp")
self.op_comm_postprocess_layer = DummyOp("op_comm_postprocess_layer")
layer = BadLayer()
with pytest.raises(AttributeError):
_compute_moe_deepseek_blog_decode(layer) # 2.06μs -> 1.41μs (45.9% faster)
def test_missing_op_comm_postprocess_layer_raises():
"""Test that missing op_comm_postprocess_layer attribute raises an AttributeError."""
class BadLayer:
def init(self):
self.op_comm_prepare_attn = DummyOp("op_comm_prepare_attn")
self.self_attn = DummySelfAttn()
self.op_comm_prepare_mlp = DummyOp("op_comm_prepare_mlp")
self.mlp = DummyMLP()
layer = BadLayer()
with pytest.raises(AttributeError):
_compute_moe_deepseek_blog_decode(layer) # 2.96μs -> 2.45μs (20.6% faster)
def test_operations_are_unique_objects():
"""Test that each YieldOperation in the list is a unique object (not the same instance)."""
layer = DummyLayer()
codeflash_output = _compute_moe_deepseek_blog_decode(layer); strategy = codeflash_output # 2.66μs -> 2.37μs (12.1% faster)
yield_ops = [op for op in strategy.operations if isinstance(op, YieldOperation)]
ids = [id(op) for op in yield_ops]
def test_layer_with_extra_attributes_is_ignored():
"""Test that extra attributes on the layer do not affect the function."""
class ExtendedLayer(DummyLayer):
def init(self):
super().init()
self.extra = "should be ignored"
layer = ExtendedLayer()
codeflash_output = _compute_moe_deepseek_blog_decode(layer); strategy = codeflash_output # 2.59μs -> 2.27μs (14.3% faster)
# The output should be the same as DummyLayer
expected_ops = [repr(op) for op in _compute_moe_deepseek_blog_decode(DummyLayer()).operations] # 1.32μs -> 1.22μs (8.09% faster)
actual_ops = [repr(op) for op in strategy.operations]
def test_layer_with_none_operations():
"""Test that if any required operation is None, it is included as None in the operations list."""
class PartialSelfAttn:
def init(self):
self.op_prepare = None
self.op_core = DummyOp("self_attn.op_core")
class PartialMLP:
def init(self):
self.op_gate = None
self.op_select_experts = None
self.op_dispatch_a = DummyOp("mlp.op_dispatch_a")
self.op_shared_experts = DummyOp("mlp.op_shared_experts")
self.op_dispatch_b = None
self.op_experts = DummyOp("mlp.op_experts")
self.op_combine_a = None
self.op_combine_b = DummyOp("mlp.op_combine_b")
self.op_output = None
class PartialLayer:
def init(self):
self.op_comm_prepare_attn = None
self.self_attn = PartialSelfAttn()
self.op_comm_prepare_mlp = None
self.mlp = PartialMLP()
self.op_comm_postprocess_layer = None
layer = PartialLayer()
codeflash_output = _compute_moe_deepseek_blog_decode(layer); strategy = codeflash_output # 2.43μs -> 2.13μs (14.1% faster)
3. Large Scale Test Cases
def test_large_number_of_layers():
"""Test that the function can handle 1000 layers in sequence (simulate batch processing)."""
layers = [DummyLayer() for _ in range(1000)]
# Collect all strategies and check their first and last operation
for i, layer in enumerate(layers):
codeflash_output = _compute_moe_deepseek_blog_decode(layer); strategy = codeflash_output # 1.03ms -> 871μs (18.6% faster)
def test_operations_list_length_consistency_large():
"""Test that the operations list always has length 19, even for large batches."""
layers = [DummyLayer() for _ in range(1000)]
for layer in layers:
codeflash_output = _compute_moe_deepseek_blog_decode(layer); strategy = codeflash_output # 1.01ms -> 851μs (18.5% faster)
def test_performance_large_scale(monkeypatch):
"""Test function performance does not degrade unreasonably for 1000 calls."""
import time
layers = [DummyLayer() for _ in range(1000)]
start = time.time()
for layer in layers:
_compute_moe_deepseek_blog_decode(layer) # 1.01ms -> 843μs (19.7% faster)
elapsed = time.time() - start
codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
#------------------------------------------------
import pytest
from sglang.srt.operations_strategy import _compute_moe_deepseek_blog_decode
Dummy classes to simulate the expected interface for testing
class DummyOperation:
"""A dummy operation class for testing identity and ordering."""
def init(self, name):
self.name = name
def repr(self):
return f"DummyOperation({self.name})"
def eq(self, other):
return isinstance(other, DummyOperation) and self.name == other.name
class DummyYieldOperation(DummyOperation):
"""A subclass to represent yield operations."""
def init(self):
super().init("YieldOperation")
class DummySelfAttn:
"""Simulates the self_attn attribute with expected operations."""
def init(self):
self.op_prepare = DummyOperation("self_attn.op_prepare")
self.op_core = DummyOperation("self_attn.op_core")
class DummyMLP:
"""Simulates the mlp attribute with expected operations."""
def init(self):
self.op_gate = DummyOperation("mlp.op_gate")
self.op_select_experts = DummyOperation("mlp.op_select_experts")
self.op_dispatch_a = DummyOperation("mlp.op_dispatch_a")
self.op_shared_experts = DummyOperation("mlp.op_shared_experts")
self.op_dispatch_b = DummyOperation("mlp.op_dispatch_b")
self.op_experts = DummyOperation("mlp.op_experts")
self.op_combine_a = DummyOperation("mlp.op_combine_a")
self.op_combine_b = DummyOperation("mlp.op_combine_b")
self.op_output = DummyOperation("mlp.op_output")
class DummyLayer:
"""Simulates the layer argument with all required attributes."""
def init(self):
self.op_comm_prepare_attn = DummyOperation("op_comm_prepare_attn")
self.self_attn = DummySelfAttn()
self.op_comm_prepare_mlp = DummyOperation("op_comm_prepare_mlp")
self.mlp = DummyMLP()
self.op_comm_postprocess_layer = DummyOperation("op_comm_postprocess_layer")
Dummy OperationsStrategy for testing
class OperationsStrategy:
def init(self, deep_gemm_num_sms, tbo_delta_stages, operations):
self.deep_gemm_num_sms = deep_gemm_num_sms
self.tbo_delta_stages = tbo_delta_stages
self.operations = operations
Dummy operations module for testing
class DummyOperationsModule:
@staticmethod
def YieldOperation():
return DummyYieldOperation()
operations = DummyOperationsModule()
from sglang.srt.operations_strategy import _compute_moe_deepseek_blog_decode
unit tests
----------- BASIC TEST CASES -----------
def test_basic_correct_operations_order_and_types():
"""Test that the function returns the expected operation sequence and fields for a normal layer."""
layer = DummyLayer()
codeflash_output = _compute_moe_deepseek_blog_decode(layer); result = codeflash_output # 3.12μs -> 2.44μs (28.2% faster)
# Check the operations list length and order
expected_ops = [
layer.op_comm_prepare_attn,
layer.self_attn.op_prepare,
DummyYieldOperation(),
layer.self_attn.op_core,
layer.op_comm_prepare_mlp,
layer.mlp.op_gate,
layer.mlp.op_select_experts,
DummyYieldOperation(),
layer.mlp.op_dispatch_a,
layer.mlp.op_shared_experts,
DummyYieldOperation(),
layer.mlp.op_dispatch_b,
layer.mlp.op_experts,
layer.mlp.op_combine_a,
DummyYieldOperation(),
layer.mlp.op_combine_b,
DummyYieldOperation(),
layer.mlp.op_output,
layer.op_comm_postprocess_layer,
]
for actual, expected in zip(result.operations, expected_ops):
pass
def test_basic_yield_operations_are_present():
"""Test that the correct number of yield operations are present in the output."""
layer = DummyLayer()
codeflash_output = _compute_moe_deepseek_blog_decode(layer); result = codeflash_output # 2.12μs -> 1.85μs (14.2% faster)
yield_count = sum(isinstance(op, DummyYieldOperation) for op in result.operations)
----------- EDGE TEST CASES -----------
def test_edge_missing_self_attn_raises():
"""Test that missing self_attn attribute raises an AttributeError."""
layer = DummyLayer()
delattr(layer, 'self_attn')
with pytest.raises(AttributeError):
_compute_moe_deepseek_blog_decode(layer) # 1.50μs -> 1.35μs (10.5% faster)
def test_edge_missing_mlp_raises():
"""Test that missing mlp attribute raises an AttributeError."""
layer = DummyLayer()
delattr(layer, 'mlp')
with pytest.raises(AttributeError):
_compute_moe_deepseek_blog_decode(layer) # 2.21μs -> 1.40μs (58.3% faster)
def test_edge_missing_op_comm_prepare_attn_raises():
"""Test that missing op_comm_prepare_attn attribute raises an AttributeError."""
layer = DummyLayer()
delattr(layer, 'op_comm_prepare_attn')
with pytest.raises(AttributeError):
_compute_moe_deepseek_blog_decode(layer) # 1.46μs -> 1.65μs (11.6% slower)
def test_edge_missing_op_comm_postprocess_layer_raises():
"""Test that missing op_comm_postprocess_layer attribute raises an AttributeError."""
layer = DummyLayer()
delattr(layer, 'op_comm_postprocess_layer')
with pytest.raises(AttributeError):
_compute_moe_deepseek_blog_decode(layer) # 2.86μs -> 2.52μs (13.6% faster)
def test_edge_self_attn_missing_op_prepare_raises():
"""Test that missing op_prepare in self_attn raises AttributeError."""
layer = DummyLayer()
delattr(layer.self_attn, 'op_prepare')
with pytest.raises(AttributeError):
_compute_moe_deepseek_blog_decode(layer) # 1.45μs -> 1.51μs (3.78% slower)
def test_edge_mlp_missing_op_gate_raises():
"""Test that missing op_gate in mlp raises AttributeError."""
layer = DummyLayer()
delattr(layer.mlp, 'op_gate')
with pytest.raises(AttributeError):
_compute_moe_deepseek_blog_decode(layer) # 2.06μs -> 1.98μs (3.89% faster)
def test_edge_operations_are_unique_objects():
"""Test that each operation in the output is a unique object (no accidental aliasing)."""
layer = DummyLayer()
codeflash_output = _compute_moe_deepseek_blog_decode(layer); result = codeflash_output # 2.87μs -> 2.60μs (10.5% faster)
# All operations should be unique by identity, except yield operations
non_yield_ops = [op for op in result.operations if not isinstance(op, DummyYieldOperation)]
def test_edge_layer_is_none_raises():
"""Test that passing None as the layer raises an AttributeError."""
with pytest.raises(AttributeError):
_compute_moe_deepseek_blog_decode(None) # 1.40μs -> 1.28μs (9.44% faster)
----------- LARGE SCALE TEST CASES -----------
def test_large_scale_many_layers():
"""Test the function's scalability by creating many layers and ensuring correct output."""
# Create 100 layers and check each result
for i in range(100):
layer = DummyLayer()
codeflash_output = _compute_moe_deepseek_blog_decode(layer); result = codeflash_output # 103μs -> 87.2μs (18.5% faster)
def test_large_scale_operation_identity():
"""Test that yield operations are different instances (not shared across calls)."""
layer1 = DummyLayer()
layer2 = DummyLayer()
codeflash_output = _compute_moe_deepseek_blog_decode(layer1); result1 = codeflash_output # 2.24μs -> 1.85μs (20.9% faster)
codeflash_output = _compute_moe_deepseek_blog_decode(layer2); result2 = codeflash_output # 1.23μs -> 1.06μs (16.3% faster)
# For each yield operation, check that they are not the same object
yield_ops1 = [op for op in result1.operations if isinstance(op, DummyYieldOperation)]
yield_ops2 = [op for op in result2.operations if isinstance(op, DummyYieldOperation)]
for op1, op2 in zip(yield_ops1, yield_ops2):
pass
def test_large_scale_custom_operations_in_layer():
"""Test that custom operation objects in the layer are preserved in the output."""
layer = DummyLayer()
# Replace one operation with a custom one
custom_op = DummyOperation("custom_op_comm_prepare_attn")
layer.op_comm_prepare_attn = custom_op
codeflash_output = _compute_moe_deepseek_blog_decode(layer); result = codeflash_output # 2.10μs -> 1.83μs (14.2% faster)
def test_large_scale_operations_content_integrity():
"""Test that the output operations contain only expected operation types."""
layer = DummyLayer()
codeflash_output = _compute_moe_deepseek_blog_decode(layer); result = codeflash_output # 2.10μs -> 1.77μs (18.4% faster)
# All items should be DummyOperation or DummyYieldOperation
for op in result.operations:
pass
codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
To edit these changes
git checkout codeflash/optimize-_compute_moe_deepseek_blog_decode-mhtwsuw2and push.