⚡️ Speed up function require_mlp_sync by 9%
#466
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
📄 9% (0.09x) speedup for
require_mlp_syncinpython/sglang/srt/utils/common.py⏱️ Runtime :
40.2 microseconds→36.9 microseconds(best of250runs)📝 Explanation and details
The optimization applies short-circuit evaluation to both functions by restructuring boolean expressions as conditional statements, providing a 9% speedup.
Key Changes:
require_gathered_buffer: Changed fromreturn require_mlp_tp_gather(server_args) or require_attn_tp_gather(server_args)to evaluaterequire_attn_tp_gatherfirst and return early if True, only callingrequire_mlp_tp_gatherwhen necessary.require_mlp_sync: Changed fromreturn server_args.enable_dp_attention or require_gathered_buffer(server_args)to check the simple field accessserver_args.enable_dp_attentionfirst and return early if True, avoiding the more expensive nested function calls.Why This is Faster:
server_args.enable_dp_attentionis a simple attribute lookup, whilerequire_gathered_bufferinvolves multiple function calls with assertions and complex logic.Performance Impact in Hot Path:
The function references show
require_mlp_syncis called in critical scheduler event loops (event_loop_normal_disagg_decodeandevent_loop_overlap_disagg_decode) that run continuously during model serving. The optimization is particularly effective for workloads whereenable_dp_attention=True(common in distributed attention scenarios), providing immediate returns and avoiding deeper computational branches.Test Case Analysis:
The optimization shows strongest gains (15-25% faster) when
enable_dp_attention=Falseandrequire_gathered_bufferwould normally be evaluated, and modest improvements whenenable_dp_attention=Truedue to early returns.✅ Correctness verification report:
🌀 Generated Regression Tests and Runtime
To edit these changes
git checkout codeflash/optimize-require_mlp_sync-mijrdxzband push.