Skip to content

Perf weight loading#23

Open
visz11 wants to merge 5 commits intomainfrom
perf-weight-loading
Open

Perf weight loading#23
visz11 wants to merge 5 commits intomainfrom
perf-weight-loading

Conversation

@visz11
Copy link
Copy Markdown

@visz11 visz11 commented Apr 13, 2026

CodeAnt-AI Description

Skip loading expert weights that belong to other expert-parallel ranks

What Changed

  • During model load, non-local expert weights are now skipped before they are read from disk
  • Dense weights, shared expert weights, and fused expert tensors still load normally
  • The loader now supports both contiguous and round-robin expert placement when deciding which experts belong to each rank
  • Added coverage for expert ID parsing, expert sharding, and filtered safetensor loading

Impact

✅ Lower model load time for MoE checkpoints
✅ Less disk and network I/O during expert-parallel startup
✅ Fewer wasted reads on non-local expert weights

💡 Usage Guide

Checking Your Pull Request

Every time you make a pull request, our system automatically looks through it. We check for security issues, mistakes in how you're setting up your infrastructure, and common code problems. We do this to make sure your changes are solid and won't cause any trouble later.

Talking to CodeAnt AI

Got a question or need a hand with something in your pull request? You can easily get in touch with CodeAnt AI right here. Just type the following in a comment on your pull request, and replace "Your question here" with whatever you want to ask:

@codeant-ai ask: Your question here

This lets you have a chat with CodeAnt AI about your pull request, making it easier to understand and improve your code.

Example

@codeant-ai ask: Can you suggest a safer alternative to storing this secret?

Preserve Org Learnings with CodeAnt

You can record team preferences so CodeAnt AI applies them in future reviews. Reply directly to the specific CodeAnt AI suggestion (in the same thread) and replace "Your feedback here" with your input:

@codeant-ai: Your feedback here

This helps CodeAnt AI learn and adapt to your team's coding style and standards.

Example

@codeant-ai: Do not flag unused imports.

Retrigger review

Ask CodeAnt AI to review the PR again, by typing:

@codeant-ai: review

Check Your Repository Health

To analyze the health of your code repository, visit our dashboard at https://app.codeant.ai. This tool helps you identify potential issues and areas for improvement in your codebase, ensuring your repository maintains high standards of code health.

Summary by CodeRabbit

  • New Features

    • Added Expert Parallelism weight filtering for Mixture-of-Experts models, enabling efficient distributed loading by automatically skipping non-local expert weights during model initialization.
  • Tests

    • Added comprehensive test suite for Expert Parallelism weight filtering, covering expert identification, distributed expert assignment, and weight selection logic.

esmeetu added 5 commits March 16, 2026 10:16
Signed-off-by: esmeetu <jasonailu87@gmail.com>
Signed-off-by: esmeetu <jasonailu87@gmail.com>
Signed-off-by: esmeetu <jasonailu87@gmail.com>
Signed-off-by: esmeetu <jasonailu87@gmail.com>
Signed-off-by: esmeetu <jasonailu87@gmail.com>
@codeant-ai
Copy link
Copy Markdown

codeant-ai Bot commented Apr 13, 2026

CodeAnt AI is reviewing your PR.


Thanks for using CodeAnt! 🎉

We're free for open-source projects. if you're enjoying it, help us grow by sharing.

Share on X ·
Reddit ·
LinkedIn

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 13, 2026

📝 Walkthrough

Walkthrough

This pull request adds expert parallelism (EP) weight filtering functionality to vLLM's model loader. It introduces a new utility module that filters non-local expert tensors during model loading, integrates the filtering into the default model loader's weight iteration path, and provides comprehensive tests for the filtering logic across various expert configurations and weight naming patterns.

Changes

Cohort / File(s) Summary
Expert Parallelism Weight Filter Utilities
vllm/model_executor/model_loader/ep_weight_filter.py
New module providing three core functions: parse_expert_id extracts numeric expert identifiers from weight names, compute_local_expert_ids partitions global experts across ranks using linear or round-robin strategies, and should_skip_weight determines whether to filter tensors based on local expert ownership.
Weight Iterator Integration
vllm/model_executor/model_loader/weight_utils.py
Updated safetensors_weights_iterator to accept optional local_expert_ids parameter; filters expert tensors during iteration across eager, torchao, and default load modes by consulting should_skip_weight before yielding weights.
Model Loader Integration
vllm/model_executor/model_loader/default_loader.py
Added _init_ep_weight_filter method that computes local expert IDs when MoE and expert parallelism are enabled; integrated initialization into load_weights flow and passes computed IDs to the safetensors iterator; logs expert distribution counts.
Comprehensive Test Coverage
tests/model_executor/model_loader/test_ep_weight_filter.py
New test module with unit tests for expert ID parsing (routed/quantized tensors, boundary cases, non-expert patterns), local expert ID computation (various partition strategies and expert counts), weight skipping logic (dense vs. expert tensors, shared experts), and integration tests for safetensors filtering with synthetic and real-world models.

Sequence Diagram

sequenceDiagram
    participant DML as DefaultModelLoader
    participant EPWF as EP Weight Filter
    participant SWI as Safetensors<br/>Weights Iterator
    participant Dist as Distributed<br/>System
    
    DML->>Dist: Query rank & group info
    Dist-->>DML: ep_rank, ep_size
    DML->>EPWF: compute_local_expert_ids(num_experts,<br/>ep_size, ep_rank)
    EPWF-->>DML: set[local_expert_ids]
    DML->>SWI: safetensors_weights_iterator(...,<br/>local_expert_ids=...)
    loop For each weight in safetensors
        SWI->>EPWF: should_skip_weight(weight_name,<br/>local_expert_ids)
        EPWF->>EPWF: parse_expert_id(weight_name)
        alt Expert tensor & not local
            EPWF-->>SWI: skip=True
            SWI->>SWI: omit from iteration
        else Local expert or dense weight
            EPWF-->>SWI: skip=False
            SWI->>DML: yield (weight_name, tensor)
        end
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 Expert weights now filter with care,
Each rank loads only its local share,
Linear or round-robin, partitioned just right,
Mixture of Experts shines bright!

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 23.91% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The PR title 'Perf weight loading' is vague and does not convey the specific technical changes. While the PR implements expert parallelism (EP) weight filtering for optimized model loading, the title uses non-descriptive terms that don't clearly communicate the main contribution. Replace with a more specific title that describes the main feature, such as 'Add expert parallelism weight filtering for optimized model loading' or 'Implement EP weight filter to skip non-local expert tensors'.
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch perf-weight-loading

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@codeant-ai codeant-ai Bot added the size:XL This PR changes 500-999 lines, ignoring generated files label Apr 13, 2026
@refacto-visz
Copy link
Copy Markdown

refacto-visz Bot commented Apr 13, 2026

Refacto PR Summary

Introduces EP (Expert Parallelism) weight filtering during model loading to eliminate redundant I/O for MoE models. Each rank now skips non-local expert tensors before reading from disk, reducing storage I/O by ~85-90% for large MoE models under EP deployments. The implementation adds a dedicated ep_weight_filter.py module with regex-based expert ID parsing, expert assignment computation (linear/round-robin), and integrates filtering into safetensors_weights_iterator across all three load strategies (eager, torchao, lazy).

Key Changes:

  • New ep_weight_filter.py provides parse_expert_id, compute_local_expert_ids, and should_skip_weight — pure utility functions with no side effects
  • safetensors_weights_iterator gains optional local_expert_ids: set[int] | None parameter; filtering applied pre-read across eager/torchao/lazy strategies
  • DefaultModelLoader._init_ep_weight_filter() computes EP rank/size from parallel_config (dp × pcp × tp) and populates self.local_expert_ids before weight loading begins
  • Supports both linear (contiguous) and round_robin (interleaved) expert placement strategies matching FusedMoEParallelConfig.make() logic
  • 3D fused-expert tensors (no numeric ID in name) are intentionally passed through unfiltered for later slicing by weight_loader

Change Highlights

Click to expand
  • vllm/model_executor/model_loader/ep_weight_filter.py: New module — regex expert ID extraction, local expert set computation, skip predicate
  • vllm/model_executor/model_loader/weight_utils.py: safetensors_weights_iterator extended with local_expert_ids param; filtering injected into all 3 load paths
  • vllm/model_executor/model_loader/default_loader.py: _init_ep_weight_filter() added; EP rank/size derived from parallel config; local_expert_ids threaded into weight iterator call
  • tests/model_executor/model_loader/test_ep_weight_filter.py: Full unit + integration test suite covering parse, compute, skip logic, and synthetic MoE safetensors filtering

Sequence Diagram

sequenceDiagram
    participant DL as DefaultModelLoader
    participant EF as ep_weight_filter
    participant PC as ParallelConfig
    participant WU as safetensors_weights_iterator
    participant DISK as Safetensors Files

    DL->>DL: load_weights(model, model_config)
    DL->>DL: _init_ep_weight_filter(model_config)
    DL->>PC: get ep_size, ep_rank (dp×pcp×tp)
    DL->>EF: compute_local_expert_ids(num_experts, ep_size, ep_rank, placement)
    EF-->>DL: set[int] local_expert_ids (or None if ep_size≤1)
    DL->>WU: get_all_weights(..., local_expert_ids=self.local_expert_ids)
    loop Each safetensors shard
        WU->>DISK: open shard, iterate tensor names
        WU->>EF: should_skip_weight(name, local_expert_ids)
        EF->>EF: parse_expert_id(name) via regex
        alt expert id not in local_expert_ids
            EF-->>WU: True → skip (no disk read)
        else dense / shared / local expert
            EF-->>WU: False → read tensor
            WU-->>DL: yield (name, tensor)
        end
    end
Loading

Testing Guide

Click to expand
  1. Unit — parse_expert_id: Verify parse_expert_id("model.layers.0.mlp.experts.42.gate_proj.weight") == 42; confirm None for attention, layernorm, shared experts, and 3D fused tensors like experts.gate_proj.weight
  2. Unit — compute_local_expert_ids: With ep_size=1 confirm None returned; with 384 experts / EP=8 confirm each rank gets exactly 48 experts, union covers range(384), sets are disjoint; repeat for round_robin placement
  3. Unit — should_skip_weight: With local_ids = {0..47}, confirm expert 10 not skipped, expert 200 skipped, expert 47 not skipped, expert 48 skipped; confirm dense/shared/3D-fused weights never skipped regardless of local_ids
  4. Integration — synthetic MoE safetensors: Create 8-expert safetensors file; load with EP=2 rank=0 (local_ids={0,1,2,3}); assert exactly 12 expert tensors loaded, all dense/shared weights present, tensor values match unfiltered load
  5. Edge case — no MoE model: Load GPT-2 safetensors with local_expert_ids=set(); assert all weights returned (no expert names match regex, nothing filtered)

@refacto-visz
Copy link
Copy Markdown

refacto-visz Bot commented Apr 13, 2026

Refacto is reviewing this PR. Please wait for the review comments to be posted.

Comment on lines +341 to +346
self.local_expert_ids = compute_local_expert_ids(
num_experts,
ep_size,
ep_rank,
placement=parallel_config.expert_placement_strategy,
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 Architect Review — HIGH

EP weight filtering derives local expert IDs from parallel_config.expert_placement_strategy, but FusedMoE may downgrade "round_robin" to "linear" at runtime via determine_expert_placement_strategy. In configurations where this fallback occurs, the loader computes local_expert_ids assuming round_robin while the MoE layer actually uses linear placement, so required local expert weights for the true placement can be skipped during loading.

Suggestion: Compute local_expert_ids using the same resolved placement logic as FusedMoE (e.g., by applying determine_expert_placement_strategy to the parallel config before calling compute_local_expert_ids, or by sharing a single source of truth for the final expert placement strategy) so the loader's expert set always matches the runtime expert map.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements expert parallelism (EP) weight filtering to optimize model loading by skipping non-local expert weights. This change aims to significantly reduce storage I/O and memory overhead for Mixture-of-Experts (MoE) models. The implementation includes a new filtering utility, integration into the default model loader, and comprehensive tests. Review feedback highlighted that the 'eager' loading strategy still reads entire files into memory before filtering, which limits the I/O reduction benefits for that specific loading path.

Comment on lines 746 to +751
if safetensors_load_strategy == "eager":
with open(st_file, "rb") as f:
state_dict = load(f.read())
yield from state_dict.items()
for name, param in state_dict.items():
if not should_skip_weight(name, local_expert_ids):
yield name, param
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

In the eager loading strategy, the entire safetensors file is read into memory via load(f.read()) before filtering occurs. While this correctly filters the yielded tensors, it does not provide the storage I/O reduction benefit described in the PR (skipping non-local tensors before they are read from disk).

For the eager strategy to benefit from I/O reduction, you would need to use safe_open to inspect keys and only get_tensor for the ones that aren't skipped. However, since eager is often used when the user specifically wants to load the whole file into memory at once, this might be acceptable, but it's worth noting that the performance gain here is only in memory/CPU, not disk I/O.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
vllm/model_executor/model_loader/default_loader.py (1)

226-244: ⚠️ Potential issue | 🟡 Minor

EP weight filtering is not applied to alternative weight loaders.

The local_expert_ids parameter is only passed to safetensors_weights_iterator (line 250), but not to fastsafetensors_weights_iterator, instanttensor_weights_iterator, or multi_thread_safetensors_weights_iterator.

When using load_format=fastsafetensors, instanttensor, or enable_multithread_load=True, MoE models with EP enabled will still load all expert weights to every rank, defeating the I/O optimization.

Consider either:

  1. Extending these iterators to support EP filtering, or
  2. Adding a warning when EP is enabled but the load format doesn't support filtering
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@vllm/model_executor/model_loader/default_loader.py` around lines 226 - 244,
EP filtering (local_expert_ids) is only applied to safetensors_weights_iterator
and missing for fastsafetensors_weights_iterator,
instanttensor_weights_iterator, and multi_thread_safetensors_weights_iterator;
either extend those iterator functions to accept and honor a local_expert_ids
parameter and pass it from where load_format is checked (refer to
load_config.load_format, local_expert_ids, fastsafetensors_weights_iterator,
instanttensor_weights_iterator, multi_thread_safetensors_weights_iterator), or
if extending is not feasible, add a clear warning when EP is enabled and the
chosen load_format does not support expert filtering so users know I/O
optimization will be disabled; implement the change by updating the iterator
signatures and call sites here to forward local_expert_ids, or by emitting the
warning in this conditional branch.
🧹 Nitpick comments (2)
vllm/model_executor/model_loader/weight_utils.py (1)

746-751: In eager mode, EP filtering happens after the full shard is loaded into memory.

The load(f.read()) call at line 748 reads the entire safetensors file into memory before the filtering loop at lines 749-751. While the filtering still prevents non-local expert tensors from being yielded downstream, the disk I/O savings mentioned in the docstring ("skipped before reading from disk") don't fully apply to this path.

This is acceptable since eager mode is not the default, but consider adding a comment noting this limitation for clarity.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@vllm/model_executor/model_loader/weight_utils.py` around lines 746 - 751, The
eager safetensors path (when safetensors_load_strategy == "eager") currently
calls load(f.read()) which reads the entire st_file into memory before applying
EP filtering in the should_skip_weight loop; add a clear inline comment above
that block referencing safetensors_load_strategy, st_file, load, and
should_skip_weight that documents this limitation (i.e., that eager mode does
not avoid disk I/O for skipped experts because the full shard is read into
memory) so readers understand why the docstring’s “skipped before reading from
disk” claim does not apply to eager mode.
tests/model_executor/model_loader/test_ep_weight_filter.py (1)

293-332: Rename unused expected variables to suppress linter warnings.

The expected variable from the fixture unpacking is unused in these test methods. Prefix with underscore to indicate intentional non-use.

✏️ Proposed fix
     def test_no_filter_returns_all(self, synthetic_moe_files):
-        files, expected = synthetic_moe_files
+        files, _ = synthetic_moe_files
         loaded = dict(safetensors_weights_iterator(files, False))
         assert set(loaded.keys()) == set(expected.keys())

     def test_ep2_rank0_gets_half_experts(self, synthetic_moe_files):
-        files, expected = synthetic_moe_files
+        files, _ = synthetic_moe_files
         # EP=2, rank=0 → experts 0-3
         local_ids = compute_local_expert_ids(8, ep_size=2, ep_rank=0)

Note: Line 296 actually uses expected.keys(), so only line 323 should be changed:

     def test_ep2_rank1_gets_other_half(self, synthetic_moe_files):
-        files, expected = synthetic_moe_files
+        files, _ = synthetic_moe_files
         local_ids = compute_local_expert_ids(8, ep_size=2, ep_rank=1)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/model_executor/model_loader/test_ep_weight_filter.py` around lines 293
- 332, Rename the unused fixture variable "expected" to "_expected" in the test
functions test_ep2_rank0_gets_half_experts and test_ep2_rank1_gets_other_half so
linters know the value is intentionally unused; leave the first
test_no_filter_returns_all unchanged (it uses expected.keys()). Update the
function signatures where the fixture is unpacked (the tuple assignment in those
two tests) to use _expected instead of expected.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@vllm/model_executor/model_loader/default_loader.py`:
- Around line 226-244: EP filtering (local_expert_ids) is only applied to
safetensors_weights_iterator and missing for fastsafetensors_weights_iterator,
instanttensor_weights_iterator, and multi_thread_safetensors_weights_iterator;
either extend those iterator functions to accept and honor a local_expert_ids
parameter and pass it from where load_format is checked (refer to
load_config.load_format, local_expert_ids, fastsafetensors_weights_iterator,
instanttensor_weights_iterator, multi_thread_safetensors_weights_iterator), or
if extending is not feasible, add a clear warning when EP is enabled and the
chosen load_format does not support expert filtering so users know I/O
optimization will be disabled; implement the change by updating the iterator
signatures and call sites here to forward local_expert_ids, or by emitting the
warning in this conditional branch.

---

Nitpick comments:
In `@tests/model_executor/model_loader/test_ep_weight_filter.py`:
- Around line 293-332: Rename the unused fixture variable "expected" to
"_expected" in the test functions test_ep2_rank0_gets_half_experts and
test_ep2_rank1_gets_other_half so linters know the value is intentionally
unused; leave the first test_no_filter_returns_all unchanged (it uses
expected.keys()). Update the function signatures where the fixture is unpacked
(the tuple assignment in those two tests) to use _expected instead of expected.

In `@vllm/model_executor/model_loader/weight_utils.py`:
- Around line 746-751: The eager safetensors path (when
safetensors_load_strategy == "eager") currently calls load(f.read()) which reads
the entire st_file into memory before applying EP filtering in the
should_skip_weight loop; add a clear inline comment above that block referencing
safetensors_load_strategy, st_file, load, and should_skip_weight that documents
this limitation (i.e., that eager mode does not avoid disk I/O for skipped
experts because the full shard is read into memory) so readers understand why
the docstring’s “skipped before reading from disk” claim does not apply to eager
mode.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: c1fdf492-2fe0-45c5-b531-4b0e0264d106

📥 Commits

Reviewing files that changed from the base of the PR and between 7362b44 and 61c8936.

📒 Files selected for processing (4)
  • tests/model_executor/model_loader/test_ep_weight_filter.py
  • vllm/model_executor/model_loader/default_loader.py
  • vllm/model_executor/model_loader/ep_weight_filter.py
  • vllm/model_executor/model_loader/weight_utils.py

Comment on lines +311 to +316
from vllm.config import get_current_vllm_config

vllm_config = get_current_vllm_config()
parallel_config = vllm_config.parallel_config

if not (model_config.is_moe and parallel_config.enable_expert_parallel):
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: The new EP filter initializer unconditionally calls get_current_vllm_config(), which raises an assertion when load_weights is used outside a config context; this breaks the standalone loader API. Guard this by first checking whether MoE filtering is needed and using the optional config getter so non-EP/non-context paths still work. [logic error]

Severity Level: Critical 🚨
- ❌ Standalone DefaultModelLoader.load_weights crashes without config context.
- ⚠️ Breaks documented inplace weight-loading API for users.
Suggested change
from vllm.config import get_current_vllm_config
vllm_config = get_current_vllm_config()
parallel_config = vllm_config.parallel_config
if not (model_config.is_moe and parallel_config.enable_expert_parallel):
from vllm.config import get_current_vllm_config_or_none
if not model_config.is_moe:
return
vllm_config = get_current_vllm_config_or_none()
if vllm_config is None:
return
parallel_config = vllm_config.parallel_config
if not parallel_config.enable_expert_parallel:
Steps of Reproduction ✅
1. In user code, construct a loader via the public factory `get_model_loader` from
`vllm/model_executor/model_loader/__init__.py:120-125` (e.g., `load_config =
LoadConfig(load_format="safetensors"); loader = get_model_loader(load_config)`), or
instantiate `DefaultModelLoader` directly (`default_loader.py:43-76`).

2. Create a `ModelConfig` for any model (MoE or non-MoE) and an initialized `nn.Module`
instance, but do **not** wrap the code in a `set_current_vllm_config(...)` context (the
context manager is used in the engine at `vllm/v1/worker/gpu_worker.py:333-335`, but is
absent in this standalone usage).

3. Call the documented standalone API `loader.load_weights(model, model_config)`
implemented in `DefaultModelLoader.load_weights` at
`vllm/model_executor/model_loader/default_loader.py:356-371`, as suggested by the abstract
base method comment in `base_loader.py:5-8` ("This standalone API allows inplace weights
loading for an already-initialized model").

4. `load_weights` unconditionally calls `self._init_ep_weight_filter(model_config)`
(`default_loader.py:367`), which in turn imports and calls `get_current_vllm_config()` at
`default_loader.py:311-313`; since no vLLM config has been set,
`get_current_vllm_config()` in `vllm/config/vllm.py:19-28` sees `_current_vllm_config is
None` and raises `AssertionError("Current vLLM config is not set...")`, causing
`DefaultModelLoader.load_weights` to crash before any weights are loaded in standalone
usage.
Prompt for AI Agent 🤖
This is a comment left during a code review.

**Path:** vllm/model_executor/model_loader/default_loader.py
**Line:** 311:316
**Comment:**
	*Logic Error: The new EP filter initializer unconditionally calls `get_current_vllm_config()`, which raises an assertion when `load_weights` is used outside a config context; this breaks the standalone loader API. Guard this by first checking whether MoE filtering is needed and using the optional config getter so non-EP/non-context paths still work.

Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise.
👍 | 👎

@codeant-ai
Copy link
Copy Markdown

codeant-ai Bot commented Apr 13, 2026

CodeAnt AI finished reviewing your PR.

@refacto-visz
Copy link
Copy Markdown

refacto-visz Bot commented Apr 13, 2026

Code Review: Expert Parallelism Weight Filter

PR Confidence Score: 🟥 0 / 5

👍 Well Done
Pre-Disk Expert Filtering

Skipping non-local expert weights before disk read eliminates 85-90% I/O overhead for EP configurations, a significant architectural win.

Clean Module Separation

EP filtering logic is cleanly isolated in a dedicated module with clear single responsibility, improving testability and maintainability.

Expert ID Regex Precision

Regex correctly excludes 3D fused-expert tensors without a numeric ID, preventing false positives in the filtering logic.

Comprehensive Test Coverage

Tests cover boundary experts, round-robin, uneven splits, fused tensors, and tensor value integrity across all EP ranks.

📁 Selected files for review (4)
  • tests/model_executor/model_loader/test_ep_weight_filter.py
  • vllm/model_executor/model_loader/default_loader.py
  • vllm/model_executor/model_loader/ep_weight_filter.py
  • vllm/model_executor/model_loader/weight_utils.py
🎯 Custom Instructions
❌ Unapplied Instructions
Organization Guidelines

Reason: Your set path patterns [src/, config/] don't match any selected files for review; Your set extension patterns [.java] don't match any selected files for review

📝 Additional Comments
vllm/model_executor/model_loader/ep_weight_filter.py (2)
Regex Search on Every Weight Name Without Substring Pre-Check

parse_expert_id is called for every tensor key across all shard files in the innermost loop of safetensors_weights_iterator; for a 384-expert model across hundreds of shards this means tens of thousands of regex .search() calls, the vast majority on dense/shared-expert weight names that will never match. Adding a fast '.experts.' not in weight_name substring pre-check before the regex invocation would short-circuit the dominant non-expert case with a single cheap string operation, reducing regex calls to only the small fraction of names containing the expert segment (typically a 10:1 or higher ratio).

    if '.experts.' not in weight_name:
        return None
    m = _EXPERT_ID_RE.search(weight_name)
    return int(m.group(1)) if m else None
Context References
  • vllm/model_executor/model_loader/weight_utils.py -> safetensors_weights_iterator — calls should_skip_weight inside per-key loop for all three load strategies
  • vllm/model_executor/model_loader/ep_weight_filter.py -> _EXPERT_ID_RE — module-level compiled regex used inside parse_expert_id
Placement Strategy String Dispatch Violates OCP

String-based dispatch with if/elif for placement strategies violates the Open/Closed Principle; adding a new strategy requires modifying this function directly. The placement parameter also accepts arbitrary strings with no type constraint, making invalid values a runtime-only error. Using a Literal["linear", "round_robin"] annotation or an Enum would surface misuse at static analysis time, while a registry dict would allow extension without modification and make valid values self-documenting.

Standards:

  • SOLID: Open/Closed Principle
tests/model_executor/model_loader/test_ep_weight_filter.py (1)
Missing Negative Test for Invalid Placement Strategy

TestComputeLocalExpertIds validates linear and round_robin placements but has no test asserting that an invalid placement string raises ValueError. Without this test, a future refactor that silently swallows unknown strategies would go undetected, removing a safety net that the implementation currently provides.

Context References
  • vllm/model_executor/model_loader/ep_weight_filter.py -> compute_local_expert_ids — raises ValueError for unknown placement at line 61
🧰 Additional context used
tests/model_executor/model_loader/test_ep_weight_filter.py (2)
vllm/model_executor/model_loader/weight_utils.py (2)
  • download_weights_from_hf (473-577)
  • safetensors_weights_iterator (723-788)
vllm/model_executor/model_loader/default_loader.py (1)
vllm/model_executor/model_loader/default_loader.py (1)
  • logger (40)
vllm/model_executor/model_loader/weight_utils.py (2)
vllm/model_executor/model_loader/weight_utils.py (2)
  • download_weights_from_hf (473-577)
  • safetensors_weights_iterator (723-788)

Comment on lines +31 to +36
def compute_local_expert_ids(
num_experts: int,
ep_size: int,
ep_rank: int,
placement: str = "linear",
) -> set[int] | None:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unchecked EP Rank Bounds in compute_local_expert_ids

No validation is performed to ensure ep_rank is within [0, ep_size) or that num_experts is positive before computing the expert range. A caller passing ep_rank >= ep_size produces a start offset beyond num_experts, silently returning an empty or out-of-bounds set that causes the rank to load zero expert weights — resulting in silent model corruption or inference failures with no error signal. This also means a misconfigured parallel_config propagating from _init_ep_weight_filter has no safety net at the point of use.

def compute_local_expert_ids(
    num_experts: int,
    ep_size: int,
    ep_rank: int,
    placement: str = "linear",
) -> set[int] | None:
    if ep_size <= 1:
        return None
    if num_experts <= 0:
        raise ValueError(f"num_experts must be positive, got {num_experts}.")
    if not (0 <= ep_rank < ep_size):
        raise ValueError(
            f"ep_rank {ep_rank} is out of range for ep_size {ep_size}. "
            "ep_rank must satisfy 0 <= ep_rank < ep_size."
        )
    if placement == "linear":
        base = num_experts // ep_size
        remainder = num_experts % ep_size
        start = ep_rank * base + min(ep_rank, remainder)
        local_count = base + (1 if ep_rank < remainder else 0)
        return set(range(start, start + local_count))
Commitable Suggestion
Suggested change
def compute_local_expert_ids(
num_experts: int,
ep_size: int,
ep_rank: int,
placement: str = "linear",
) -> set[int] | None:
def compute_local_expert_ids(
num_experts: int,
ep_size: int,
ep_rank: int,
placement: str = "linear",
) -> set[int] | None:
if ep_size <= 1:
return None
if num_experts <= 0:
raise ValueError(f"num_experts must be positive, got {num_experts}.")
if not (0 <= ep_rank < ep_size):
raise ValueError(
f"ep_rank {ep_rank} is out of range for ep_size {ep_size}. "
"ep_rank must satisfy 0 <= ep_rank < ep_size."
)
if placement == "linear":
base = num_experts // ep_size
remainder = num_experts % ep_size
start = ep_rank * base + min(ep_rank, remainder)
local_count = base + (1 if ep_rank < remainder else 0)
return set(range(start, start + local_count))
Standards
  • CWE-20: Improper Input Validation
Context References
  • vllm/model_executor/model_loader/ep_weight_filter.py - compute_local_expert_ids - no bounds check on ep_rank parameter
  • vllm/model_executor/model_loader/default_loader.py - _init_ep_weight_filter lines 341-346 — passes ep_rank without prior validation

allow_patterns_overrides=None,
)

def _init_ep_weight_filter(self, model_config: ModelConfig) -> None:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Silent EP Filter Failure on Config or Rank Retrieval Error

The method has no exception handling around get_current_vllm_config(), get_num_experts(), or the distributed group accessors; any runtime error (e.g., distributed group not yet initialized) propagates uncaught into load_weights, corrupting the loading pipeline. Additionally, when any parallelism dimension equals 1, the corresponding rank is hardcoded to 0 instead of being read from the process group — if a process group is misconfigured or returns an unexpected rank, ep_rank is silently wrong, causing a rank to load another rank's expert weights with no error surfaced. The computed ep_rank is also never validated against ep_size before being passed to compute_local_expert_ids.

    def _init_ep_weight_filter(self, model_config: ModelConfig) -> None:
        try:
            from vllm.config import get_current_vllm_config
            vllm_config = get_current_vllm_config()
        except Exception as e:
            logger.warning(
                "EP weight filter disabled: failed to retrieve vllm config: %s", e
            )
            return
        parallel_config = vllm_config.parallel_config

        if not (model_config.is_moe and parallel_config.enable_expert_parallel):
            return

        try:
            num_experts = model_config.get_num_experts()
        except Exception as e:
            logger.warning(
                "EP weight filter disabled: failed to get num_experts: %s", e
            )
            return
        if num_experts <= 0:
            return

        from vllm.distributed import (
            get_dp_group,
            get_pcp_group,
            get_tensor_model_parallel_rank,
        )
        dp_size = parallel_config.data_parallel_size
        tp_size = parallel_config.tensor_parallel_size
        pcp_size = parallel_config.prefill_context_parallel_size
        dp_rank = get_dp_group().rank_in_group
        tp_rank = get_tensor_model_parallel_rank()
        pcp_rank = get_pcp_group().rank_in_group
        ep_size = dp_size * pcp_size * tp_size
        ep_rank = dp_rank * pcp_size * tp_size + pcp_rank * tp_size + tp_rank

        if not (0 <= ep_rank < ep_size):
            raise RuntimeError(
                f"Computed ep_rank {ep_rank} is outside valid range [0, {ep_size}). "
                "Expert weight isolation cannot be guaranteed; aborting load."
            )

        self.local_expert_ids = compute_local_expert_ids(
            num_experts,
            ep_size,
            ep_rank,
            placement=parallel_config.expert_placement_strategy,
        )
Commitable Suggestion
Suggested change
def _init_ep_weight_filter(self, model_config: ModelConfig) -> None:
def _init_ep_weight_filter(self, model_config: ModelConfig) -> None:
try:
from vllm.config import get_current_vllm_config
vllm_config = get_current_vllm_config()
except Exception as e:
logger.warning(
"EP weight filter disabled: failed to retrieve vllm config: %s", e
)
return
parallel_config = vllm_config.parallel_config
if not (model_config.is_moe and parallel_config.enable_expert_parallel):
return
try:
num_experts = model_config.get_num_experts()
except Exception as e:
logger.warning(
"EP weight filter disabled: failed to get num_experts: %s", e
)
return
if num_experts <= 0:
return
from vllm.distributed import (
get_dp_group,
get_pcp_group,
get_tensor_model_parallel_rank,
)
dp_size = parallel_config.data_parallel_size
tp_size = parallel_config.tensor_parallel_size
pcp_size = parallel_config.prefill_context_parallel_size
dp_rank = get_dp_group().rank_in_group
tp_rank = get_tensor_model_parallel_rank()
pcp_rank = get_pcp_group().rank_in_group
ep_size = dp_size * pcp_size * tp_size
ep_rank = dp_rank * pcp_size * tp_size + pcp_rank * tp_size + tp_rank
if not (0 <= ep_rank < ep_size):
raise RuntimeError(
f"Computed ep_rank {ep_rank} is outside valid range [0, {ep_size}). "
"Expert weight isolation cannot be guaranteed; aborting load."
)
self.local_expert_ids = compute_local_expert_ids(
num_experts,
ep_size,
ep_rank,
placement=parallel_config.expert_placement_strategy,
)
Standards
  • CWE-390: Detection of Error Condition Without Action
  • CWE-20: Improper Input Validation
Context References
  • vllm/model_executor/model_loader/default_loader.py - DefaultModelLoader._init_ep_weight_filter — no try/except wrapping get_current_vllm_config or distributed group accessors; conditional rank fetch only when size > 1

Comment on lines 764 to 771
with safe_open(st_file, framework="pt") as f:
state_dict = {}
for name in f.keys(): # noqa: SIM118
if should_skip_weight(name, local_expert_ids):
continue
state_dict[name] = f.get_tensor(name)

# update with leftover tensor data from previous iteration, if any
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Torchao Leftover Dict Bypasses EP Filter

In the torchao path, leftover_state_dict from a previous shard is merged into state_dict after EP filtering has already been applied; if leftover_state_dict contains tensors belonging to non-local experts from the prior shard, they bypass should_skip_weight entirely and are injected into the current shard's state dict. This silently loads non-local expert weights despite EP filtering being active, increasing peak memory usage and potentially overwriting correctly filtered state — defeating the core purpose of the EP filter for multi-shard torchao checkpoints.

            with safe_open(st_file, framework="pt") as f:
                state_dict = {}
                for name in f.keys():  # noqa: SIM118
                    if should_skip_weight(name, local_expert_ids):
                        continue
                    state_dict[name] = f.get_tensor(name)

                # update with leftover tensor data from previous iteration, if any
                # re-apply EP filter to exclude any non-local expert leftovers
                filtered_leftover = {
                    k: v for k, v in leftover_state_dict.items()
                    if not should_skip_weight(k, local_expert_ids)
                }
                state_dict.update(filtered_leftover)
Commitable Suggestion
Suggested change
with safe_open(st_file, framework="pt") as f:
state_dict = {}
for name in f.keys(): # noqa: SIM118
if should_skip_weight(name, local_expert_ids):
continue
state_dict[name] = f.get_tensor(name)
# update with leftover tensor data from previous iteration, if any
with safe_open(st_file, framework="pt") as f:
state_dict = {}
for name in f.keys(): # noqa: SIM118
if should_skip_weight(name, local_expert_ids):
continue
state_dict[name] = f.get_tensor(name)
# update with leftover tensor data from previous iteration, if any
# re-apply EP filter to exclude any non-local expert leftovers
filtered_leftover = {
k: v for k, v in leftover_state_dict.items()
if not should_skip_weight(k, local_expert_ids)
}
state_dict.update(filtered_leftover)
Standards
  • CWE-669: Incorrect Resource Transfer Between Spheres
Context References
  • vllm/model_executor/model_loader/weight_utils.py - safetensors_weights_iterator torchao branch — leftover_state_dict.update after filter loop at line 772

Comment on lines 746 to +751
if safetensors_load_strategy == "eager":
with open(st_file, "rb") as f:
state_dict = load(f.read())
yield from state_dict.items()
for name, param in state_dict.items():
if not should_skip_weight(name, local_expert_ids):
yield name, param
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eager Load Strategy Reads All Tensors Before EP Filter

The eager strategy reads the entire shard file into memory and deserialises all tensors before the EP filter is applied, meaning non-local expert tensors are fully allocated in RAM and then discarded. For a 384-expert model this wastes the majority of peak memory and completely negates the I/O benefit the EP filter provides for the lazy and torchao paths, which correctly call should_skip_weight before f.get_tensor(name). This inconsistency means EP memory savings are silently absent for users on the eager strategy.

        if safetensors_load_strategy == "eager":
            with safe_open(st_file, framework="pt") as f:
                for name in f.keys():  # noqa: SIM118
                    if should_skip_weight(name, local_expert_ids):
                        continue
                    yield name, f.get_tensor(name)
Commitable Suggestion
Suggested change
if safetensors_load_strategy == "eager":
with open(st_file, "rb") as f:
state_dict = load(f.read())
yield from state_dict.items()
for name, param in state_dict.items():
if not should_skip_weight(name, local_expert_ids):
yield name, param
if safetensors_load_strategy == "eager":
with safe_open(st_file, framework="pt") as f:
for name in f.keys(): # noqa: SIM118
if should_skip_weight(name, local_expert_ids):
continue
yield name, f.get_tensor(name)
Context References
  • vllm/model_executor/model_loader/weight_utils.py - safetensors_weights_iterator — lazy path uses safe_open and filters before get_tensor; eager path loads entire file first

Comment on lines +332 to +339
dp_size = parallel_config.data_parallel_size
tp_size = parallel_config.tensor_parallel_size
pcp_size = parallel_config.prefill_context_parallel_size
dp_rank = get_dp_group().rank_in_group if dp_size > 1 else 0
tp_rank = get_tensor_model_parallel_rank() if tp_size > 1 else 0
pcp_rank = get_pcp_group().rank_in_group if pcp_size > 1 else 0
ep_size = dp_size * pcp_size * tp_size
ep_rank = dp_rank * pcp_size * tp_size + pcp_rank * tp_size + tp_rank
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EP Rank Computation Duplicated from FusedMoEParallelConfig

The docstring on line 323 explicitly states this mirrors FusedMoEParallelConfig.make(), meaning the same ep_size/ep_rank derivation logic exists in at least two places. Any future change to the EP rank formula must be synchronized across both sites, creating a DRY violation and a latent correctness risk — a divergence between the two implementations would cause the loader to filter for a different expert set than the model actually uses, producing silent weight corruption. This computation should be delegated to FusedMoEParallelConfig directly.

        from vllm.model_executor.layers.fused_moe.layer import FusedMoEParallelConfig
        ep_parallel_config = FusedMoEParallelConfig.make(parallel_config)
        ep_size = ep_parallel_config.ep_size
        ep_rank = ep_parallel_config.ep_rank
Commitable Suggestion
Suggested change
dp_size = parallel_config.data_parallel_size
tp_size = parallel_config.tensor_parallel_size
pcp_size = parallel_config.prefill_context_parallel_size
dp_rank = get_dp_group().rank_in_group if dp_size > 1 else 0
tp_rank = get_tensor_model_parallel_rank() if tp_size > 1 else 0
pcp_rank = get_pcp_group().rank_in_group if pcp_size > 1 else 0
ep_size = dp_size * pcp_size * tp_size
ep_rank = dp_rank * pcp_size * tp_size + pcp_rank * tp_size + tp_rank
from vllm.model_executor.layers.fused_moe.layer import FusedMoEParallelConfig
ep_parallel_config = FusedMoEParallelConfig.make(parallel_config)
ep_size = ep_parallel_config.ep_size
ep_rank = ep_parallel_config.ep_rank
Standards
  • DRY Principle
Context References
  • vllm/model_executor/model_loader/default_loader.py - _init_ep_weight_filter rank/size computation lines 332-339 — docstring notes this mirrors FusedMoEParallelConfig.make()

allow_patterns_overrides=None,
)

def _init_ep_weight_filter(self, model_config: ModelConfig) -> None:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SRP Violation: Loader Owns Distributed Rank Resolution

DefaultModelLoader now owns distributed rank resolution, MoE topology detection, and expert-set computation in addition to its primary weight-loading responsibility. This broadens the class's change surface: changes to EP rank derivation, placement strategy enumeration, or distributed group APIs all require modifying the loader. The mutable self.local_expert_ids initialized to None in __init__ and set later by this method also creates temporal coupling where callers of get_all_weights silently receive different behavior depending on call order. Extracting this logic into a dedicated factory function in ep_weight_filter.py would restore single responsibility and make the loader testable without a distributed context.

    def get_all_weights(
        self,
        model_config: ModelConfig,
        model: nn.Module,
        local_expert_ids: set[int] | None = None,
    ):
Commitable Suggestion
Suggested change
def _init_ep_weight_filter(self, model_config: ModelConfig) -> None:
def get_all_weights(
self,
model_config: ModelConfig,
model: nn.Module,
local_expert_ids: set[int] | None = None,
):
Standards
  • SOLID: Single Responsibility Principle
Context References
  • vllm/model_executor/model_loader/default_loader.py - safetensors_weights_iterator — local_expert_ids passed from self.local_expert_ids; __init__ sets local_expert_ids to None at line 76

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:XL This PR changes 500-999 lines, ignoring generated files

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants