Perf weight loading by visz11 · Pull Request #23 · coderabbit-test/vllm

visz11 · 2026-04-13T08:53:32Z

CodeAnt-AI Description

Skip loading expert weights that belong to other expert-parallel ranks

What Changed

During model load, non-local expert weights are now skipped before they are read from disk
Dense weights, shared expert weights, and fused expert tensors still load normally
The loader now supports both contiguous and round-robin expert placement when deciding which experts belong to each rank
Added coverage for expert ID parsing, expert sharding, and filtered safetensor loading

Impact

✅ Lower model load time for MoE checkpoints
✅ Less disk and network I/O during expert-parallel startup
✅ Fewer wasted reads on non-local expert weights

💡 Usage Guide

Checking Your Pull Request

Every time you make a pull request, our system automatically looks through it. We check for security issues, mistakes in how you're setting up your infrastructure, and common code problems. We do this to make sure your changes are solid and won't cause any trouble later.

Talking to CodeAnt AI

Got a question or need a hand with something in your pull request? You can easily get in touch with CodeAnt AI right here. Just type the following in a comment on your pull request, and replace "Your question here" with whatever you want to ask:

@codeant-ai ask: Your question here

This lets you have a chat with CodeAnt AI about your pull request, making it easier to understand and improve your code.

Example

@codeant-ai ask: Can you suggest a safer alternative to storing this secret?

Preserve Org Learnings with CodeAnt

You can record team preferences so CodeAnt AI applies them in future reviews. Reply directly to the specific CodeAnt AI suggestion (in the same thread) and replace "Your feedback here" with your input:

@codeant-ai: Your feedback here

This helps CodeAnt AI learn and adapt to your team's coding style and standards.

Example

@codeant-ai: Do not flag unused imports.

Retrigger review

Ask CodeAnt AI to review the PR again, by typing:

@codeant-ai: review

Check Your Repository Health

To analyze the health of your code repository, visit our dashboard at https://app.codeant.ai. This tool helps you identify potential issues and areas for improvement in your codebase, ensuring your repository maintains high standards of code health.

Summary by CodeRabbit

New Features
- Added Expert Parallelism weight filtering for Mixture-of-Experts models, enabling efficient distributed loading by automatically skipping non-local expert weights during model initialization.
Tests
- Added comprehensive test suite for Expert Parallelism weight filtering, covering expert identification, distributed expert assignment, and weight selection logic.

Signed-off-by: esmeetu <jasonailu87@gmail.com>

codeant-ai · 2026-04-13T08:53:36Z

CodeAnt AI is reviewing your PR.

Thanks for using CodeAnt! 🎉

We're free for open-source projects. if you're enjoying it, help us grow by sharing.

Share on X ·
Reddit ·
LinkedIn

coderabbitai · 2026-04-13T08:53:51Z

📝 Walkthrough

Walkthrough

This pull request adds expert parallelism (EP) weight filtering functionality to vLLM's model loader. It introduces a new utility module that filters non-local expert tensors during model loading, integrates the filtering into the default model loader's weight iteration path, and provides comprehensive tests for the filtering logic across various expert configurations and weight naming patterns.

Changes

Cohort / File(s)	Summary
Expert Parallelism Weight Filter Utilities `vllm/model_executor/model_loader/ep_weight_filter.py`	New module providing three core functions: `parse_expert_id` extracts numeric expert identifiers from weight names, `compute_local_expert_ids` partitions global experts across ranks using linear or round-robin strategies, and `should_skip_weight` determines whether to filter tensors based on local expert ownership.
Weight Iterator Integration `vllm/model_executor/model_loader/weight_utils.py`	Updated `safetensors_weights_iterator` to accept optional `local_expert_ids` parameter; filters expert tensors during iteration across eager, torchao, and default load modes by consulting `should_skip_weight` before yielding weights.
Model Loader Integration `vllm/model_executor/model_loader/default_loader.py`	Added `_init_ep_weight_filter` method that computes local expert IDs when MoE and expert parallelism are enabled; integrated initialization into `load_weights` flow and passes computed IDs to the safetensors iterator; logs expert distribution counts.
Comprehensive Test Coverage `tests/model_executor/model_loader/test_ep_weight_filter.py`	New test module with unit tests for expert ID parsing (routed/quantized tensors, boundary cases, non-expert patterns), local expert ID computation (various partition strategies and expert counts), weight skipping logic (dense vs. expert tensors, shared experts), and integration tests for safetensors filtering with synthetic and real-world models.

Sequence Diagram

sequenceDiagram
    participant DML as DefaultModelLoader
    participant EPWF as EP Weight Filter
    participant SWI as Safetensors<br/>Weights Iterator
    participant Dist as Distributed<br/>System
    
    DML->>Dist: Query rank & group info
    Dist-->>DML: ep_rank, ep_size
    DML->>EPWF: compute_local_expert_ids(num_experts,<br/>ep_size, ep_rank)
    EPWF-->>DML: set[local_expert_ids]
    DML->>SWI: safetensors_weights_iterator(...,<br/>local_expert_ids=...)
    loop For each weight in safetensors
        SWI->>EPWF: should_skip_weight(weight_name,<br/>local_expert_ids)
        EPWF->>EPWF: parse_expert_id(weight_name)
        alt Expert tensor & not local
            EPWF-->>SWI: skip=True
            SWI->>SWI: omit from iteration
        else Local expert or dense weight
            EPWF-->>SWI: skip=False
            SWI->>DML: yield (weight_name, tensor)
        end
    end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 Expert weights now filter with care,
Each rank loads only its local share,
Linear or round-robin, partitioned just right,
Mixture of Experts shines bright! ✨

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 23.91% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check	❓ Inconclusive	The PR title 'Perf weight loading' is vague and does not convey the specific technical changes. While the PR implements expert parallelism (EP) weight filtering for optimized model loading, the title uses non-descriptive terms that don't clearly communicate the main contribution.	Replace with a more specific title that describes the main feature, such as 'Add expert parallelism weight filtering for optimized model loading' or 'Implement EP weight filter to skip non-local expert tensors'.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch perf-weight-loading

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

refacto-visz · 2026-04-13T08:55:23Z

Refacto PR Summary

Introduces EP (Expert Parallelism) weight filtering during model loading to eliminate redundant I/O for MoE models. Each rank now skips non-local expert tensors before reading from disk, reducing storage I/O by ~85-90% for large MoE models under EP deployments. The implementation adds a dedicated ep_weight_filter.py module with regex-based expert ID parsing, expert assignment computation (linear/round-robin), and integrates filtering into safetensors_weights_iterator across all three load strategies (eager, torchao, lazy).

Key Changes:

New ep_weight_filter.py provides parse_expert_id, compute_local_expert_ids, and should_skip_weight — pure utility functions with no side effects
safetensors_weights_iterator gains optional local_expert_ids: set[int] | None parameter; filtering applied pre-read across eager/torchao/lazy strategies
DefaultModelLoader._init_ep_weight_filter() computes EP rank/size from parallel_config (dp × pcp × tp) and populates self.local_expert_ids before weight loading begins
Supports both linear (contiguous) and round_robin (interleaved) expert placement strategies matching FusedMoEParallelConfig.make() logic
3D fused-expert tensors (no numeric ID in name) are intentionally passed through unfiltered for later slicing by weight_loader

Change Highlights

Click to expand

vllm/model_executor/model_loader/ep_weight_filter.py: New module — regex expert ID extraction, local expert set computation, skip predicate
vllm/model_executor/model_loader/weight_utils.py: safetensors_weights_iterator extended with local_expert_ids param; filtering injected into all 3 load paths
vllm/model_executor/model_loader/default_loader.py: _init_ep_weight_filter() added; EP rank/size derived from parallel config; local_expert_ids threaded into weight iterator call
tests/model_executor/model_loader/test_ep_weight_filter.py: Full unit + integration test suite covering parse, compute, skip logic, and synthetic MoE safetensors filtering

Sequence Diagram

sequenceDiagram
    participant DL as DefaultModelLoader
    participant EF as ep_weight_filter
    participant PC as ParallelConfig
    participant WU as safetensors_weights_iterator
    participant DISK as Safetensors Files

    DL->>DL: load_weights(model, model_config)
    DL->>DL: _init_ep_weight_filter(model_config)
    DL->>PC: get ep_size, ep_rank (dp×pcp×tp)
    DL->>EF: compute_local_expert_ids(num_experts, ep_size, ep_rank, placement)
    EF-->>DL: set[int] local_expert_ids (or None if ep_size≤1)
    DL->>WU: get_all_weights(..., local_expert_ids=self.local_expert_ids)
    loop Each safetensors shard
        WU->>DISK: open shard, iterate tensor names
        WU->>EF: should_skip_weight(name, local_expert_ids)
        EF->>EF: parse_expert_id(name) via regex
        alt expert id not in local_expert_ids
            EF-->>WU: True → skip (no disk read)
        else dense / shared / local expert
            EF-->>WU: False → read tensor
            WU-->>DL: yield (name, tensor)
        end
    end

Testing Guide

Click to expand

Unit — parse_expert_id: Verify parse_expert_id("model.layers.0.mlp.experts.42.gate_proj.weight") == 42; confirm None for attention, layernorm, shared experts, and 3D fused tensors like experts.gate_proj.weight
Unit — compute_local_expert_ids: With ep_size=1 confirm None returned; with 384 experts / EP=8 confirm each rank gets exactly 48 experts, union covers range(384), sets are disjoint; repeat for round_robin placement
Unit — should_skip_weight: With local_ids = {0..47}, confirm expert 10 not skipped, expert 200 skipped, expert 47 not skipped, expert 48 skipped; confirm dense/shared/3D-fused weights never skipped regardless of local_ids
Integration — synthetic MoE safetensors: Create 8-expert safetensors file; load with EP=2 rank=0 (local_ids={0,1,2,3}); assert exactly 12 expert tensors loaded, all dense/shared weights present, tensor values match unfiltered load
Edge case — no MoE model: Load GPT-2 safetensors with local_expert_ids=set(); assert all weights returned (no expert names match regex, nothing filtered)

refacto-visz · 2026-04-13T08:55:24Z

Refacto is reviewing this PR. Please wait for the review comments to be posted.

codeant-ai · 2026-04-13T08:59:21Z

+        self.local_expert_ids = compute_local_expert_ids(
+            num_experts,
+            ep_size,
+            ep_rank,
+            placement=parallel_config.expert_placement_strategy,
+        )


🟠 Architect Review — HIGH

EP weight filtering derives local expert IDs from parallel_config.expert_placement_strategy, but FusedMoE may downgrade "round_robin" to "linear" at runtime via determine_expert_placement_strategy. In configurations where this fallback occurs, the loader computes local_expert_ids assuming round_robin while the MoE layer actually uses linear placement, so required local expert weights for the true placement can be skipped during loading.

Suggestion: Compute local_expert_ids using the same resolved placement logic as FusedMoE (e.g., by applying determine_expert_placement_strategy to the parallel config before calling compute_local_expert_ids, or by sharing a single source of truth for the final expert placement strategy) so the loader's expert set always matches the runtime expert map.

gemini-code-assist

Code Review

This pull request implements expert parallelism (EP) weight filtering to optimize model loading by skipping non-local expert weights. This change aims to significantly reduce storage I/O and memory overhead for Mixture-of-Experts (MoE) models. The implementation includes a new filtering utility, integration into the default model loader, and comprehensive tests. Review feedback highlighted that the 'eager' loading strategy still reads entire files into memory before filtering, which limits the I/O reduction benefits for that specific loading path.

gemini-code-assist · 2026-04-13T09:01:46Z

        if safetensors_load_strategy == "eager":
            with open(st_file, "rb") as f:
                state_dict = load(f.read())
-            yield from state_dict.items()
+            for name, param in state_dict.items():
+                if not should_skip_weight(name, local_expert_ids):
+                    yield name, param


In the eager loading strategy, the entire safetensors file is read into memory via load(f.read()) before filtering occurs. While this correctly filters the yielded tensors, it does not provide the storage I/O reduction benefit described in the PR (skipping non-local tensors before they are read from disk).

For the eager strategy to benefit from I/O reduction, you would need to use safe_open to inspect keys and only get_tensor for the ones that aren't skipped. However, since eager is often used when the user specifically wants to load the whole file into memory at once, this might be acceptable, but it's worth noting that the performance gain here is only in memory/CPU, not disk I/O.

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

vllm/model_executor/model_loader/default_loader.py (1)
226-244: ⚠️ Potential issue | 🟡 Minor

EP weight filtering is not applied to alternative weight loaders.

The local_expert_ids parameter is only passed to safetensors_weights_iterator (line 250), but not to fastsafetensors_weights_iterator, instanttensor_weights_iterator, or multi_thread_safetensors_weights_iterator.

When using load_format=fastsafetensors, instanttensor, or enable_multithread_load=True, MoE models with EP enabled will still load all expert weights to every rank, defeating the I/O optimization.

Consider either:

Extending these iterators to support EP filtering, or

Adding a warning when EP is enabled but the load format doesn't support filtering
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@vllm/model_executor/model_loader/default_loader.py` around lines 226 - 244,
EP filtering (local_expert_ids) is only applied to safetensors_weights_iterator
and missing for fastsafetensors_weights_iterator,
instanttensor_weights_iterator, and multi_thread_safetensors_weights_iterator;
either extend those iterator functions to accept and honor a local_expert_ids
parameter and pass it from where load_format is checked (refer to
load_config.load_format, local_expert_ids, fastsafetensors_weights_iterator,
instanttensor_weights_iterator, multi_thread_safetensors_weights_iterator), or
if extending is not feasible, add a clear warning when EP is enabled and the
chosen load_format does not support expert filtering so users know I/O
optimization will be disabled; implement the change by updating the iterator
signatures and call sites here to forward local_expert_ids, or by emitting the
warning in this conditional branch.

🧹 Nitpick comments (2)

vllm/model_executor/model_loader/weight_utils.py (1)
746-751: In eager mode, EP filtering happens after the full shard is loaded into memory.

The load(f.read()) call at line 748 reads the entire safetensors file into memory before the filtering loop at lines 749-751. While the filtering still prevents non-local expert tensors from being yielded downstream, the disk I/O savings mentioned in the docstring ("skipped before reading from disk") don't fully apply to this path.

This is acceptable since eager mode is not the default, but consider adding a comment noting this limitation for clarity.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@vllm/model_executor/model_loader/weight_utils.py` around lines 746 - 751, The
eager safetensors path (when safetensors_load_strategy == "eager") currently
calls load(f.read()) which reads the entire st_file into memory before applying
EP filtering in the should_skip_weight loop; add a clear inline comment above
that block referencing safetensors_load_strategy, st_file, load, and
should_skip_weight that documents this limitation (i.e., that eager mode does
not avoid disk I/O for skipped experts because the full shard is read into
memory) so readers understand why the docstring’s “skipped before reading from
disk” claim does not apply to eager mode.
tests/model_executor/model_loader/test_ep_weight_filter.py (1)
293-332: Rename unused expected variables to suppress linter warnings.

The expected variable from the fixture unpacking is unused in these test methods. Prefix with underscore to indicate intentional non-use.
✏️ Proposed fix
     def test_no_filter_returns_all(self, synthetic_moe_files):
-        files, expected = synthetic_moe_files
+        files, _ = synthetic_moe_files
         loaded = dict(safetensors_weights_iterator(files, False))
         assert set(loaded.keys()) == set(expected.keys())

     def test_ep2_rank0_gets_half_experts(self, synthetic_moe_files):
-        files, expected = synthetic_moe_files
+        files, _ = synthetic_moe_files
         # EP=2, rank=0 → experts 0-3
         local_ids = compute_local_expert_ids(8, ep_size=2, ep_rank=0)
Note: Line 296 actually uses expected.keys(), so only line 323 should be changed:
     def test_ep2_rank1_gets_other_half(self, synthetic_moe_files):
-        files, expected = synthetic_moe_files
+        files, _ = synthetic_moe_files
         local_ids = compute_local_expert_ids(8, ep_size=2, ep_rank=1)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/model_executor/model_loader/test_ep_weight_filter.py` around lines 293
- 332, Rename the unused fixture variable "expected" to "_expected" in the test
functions test_ep2_rank0_gets_half_experts and test_ep2_rank1_gets_other_half so
linters know the value is intentionally unused; leave the first
test_no_filter_returns_all unchanged (it uses expected.keys()). Update the
function signatures where the fixture is unpacked (the tuple assignment in those
two tests) to use _expected instead of expected.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@vllm/model_executor/model_loader/default_loader.py`:
- Around line 226-244: EP filtering (local_expert_ids) is only applied to
safetensors_weights_iterator and missing for fastsafetensors_weights_iterator,
instanttensor_weights_iterator, and multi_thread_safetensors_weights_iterator;
either extend those iterator functions to accept and honor a local_expert_ids
parameter and pass it from where load_format is checked (refer to
load_config.load_format, local_expert_ids, fastsafetensors_weights_iterator,
instanttensor_weights_iterator, multi_thread_safetensors_weights_iterator), or
if extending is not feasible, add a clear warning when EP is enabled and the
chosen load_format does not support expert filtering so users know I/O
optimization will be disabled; implement the change by updating the iterator
signatures and call sites here to forward local_expert_ids, or by emitting the
warning in this conditional branch.

---

Nitpick comments:
In `@tests/model_executor/model_loader/test_ep_weight_filter.py`:
- Around line 293-332: Rename the unused fixture variable "expected" to
"_expected" in the test functions test_ep2_rank0_gets_half_experts and
test_ep2_rank1_gets_other_half so linters know the value is intentionally
unused; leave the first test_no_filter_returns_all unchanged (it uses
expected.keys()). Update the function signatures where the fixture is unpacked
(the tuple assignment in those two tests) to use _expected instead of expected.

In `@vllm/model_executor/model_loader/weight_utils.py`:
- Around line 746-751: The eager safetensors path (when
safetensors_load_strategy == "eager") currently calls load(f.read()) which reads
the entire st_file into memory before applying EP filtering in the
should_skip_weight loop; add a clear inline comment above that block referencing
safetensors_load_strategy, st_file, load, and should_skip_weight that documents
this limitation (i.e., that eager mode does not avoid disk I/O for skipped
experts because the full shard is read into memory) so readers understand why
the docstring’s “skipped before reading from disk” claim does not apply to eager
mode.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: c1fdf492-2fe0-45c5-b531-4b0e0264d106

📥 Commits

Reviewing files that changed from the base of the PR and between 7362b44 and 61c8936.

📒 Files selected for processing (4)

tests/model_executor/model_loader/test_ep_weight_filter.py
vllm/model_executor/model_loader/default_loader.py
vllm/model_executor/model_loader/ep_weight_filter.py
vllm/model_executor/model_loader/weight_utils.py

codeant-ai · 2026-04-13T09:04:05Z

+        from vllm.config import get_current_vllm_config
+
+        vllm_config = get_current_vllm_config()
+        parallel_config = vllm_config.parallel_config
+
+        if not (model_config.is_moe and parallel_config.enable_expert_parallel):


Suggestion: The new EP filter initializer unconditionally calls get_current_vllm_config(), which raises an assertion when load_weights is used outside a config context; this breaks the standalone loader API. Guard this by first checking whether MoE filtering is needed and using the optional config getter so non-EP/non-context paths still work. [logic error]

Severity Level: Critical 🚨

- ❌ Standalone DefaultModelLoader.load_weights crashes without config context. - ⚠️ Breaks documented inplace weight-loading API for users.

Suggested change

from vllm.config import get_current_vllm_config

vllm_config = get_current_vllm_config()

parallel_config = vllm_config.parallel_config

if not (model_config.is_moe and parallel_config.enable_expert_parallel):

from vllm.config import get_current_vllm_config_or_none

if not model_config.is_moe:

return

vllm_config = get_current_vllm_config_or_none()

if vllm_config is None:

return

parallel_config = vllm_config.parallel_config

if not parallel_config.enable_expert_parallel:

Steps of Reproduction ✅

1. In user code, construct a loader via the public factory `get_model_loader` from `vllm/model_executor/model_loader/__init__.py:120-125` (e.g., `load_config = LoadConfig(load_format="safetensors"); loader = get_model_loader(load_config)`), or instantiate `DefaultModelLoader` directly (`default_loader.py:43-76`). 2. Create a `ModelConfig` for any model (MoE or non-MoE) and an initialized `nn.Module` instance, but do **not** wrap the code in a `set_current_vllm_config(...)` context (the context manager is used in the engine at `vllm/v1/worker/gpu_worker.py:333-335`, but is absent in this standalone usage). 3. Call the documented standalone API `loader.load_weights(model, model_config)` implemented in `DefaultModelLoader.load_weights` at `vllm/model_executor/model_loader/default_loader.py:356-371`, as suggested by the abstract base method comment in `base_loader.py:5-8` ("This standalone API allows inplace weights loading for an already-initialized model"). 4. `load_weights` unconditionally calls `self._init_ep_weight_filter(model_config)` (`default_loader.py:367`), which in turn imports and calls `get_current_vllm_config()` at `default_loader.py:311-313`; since no vLLM config has been set, `get_current_vllm_config()` in `vllm/config/vllm.py:19-28` sees `_current_vllm_config is None` and raises `AssertionError("Current vLLM config is not set...")`, causing `DefaultModelLoader.load_weights` to crash before any weights are loaded in standalone usage.

Prompt for AI Agent 🤖

This is a comment left during a code review. **Path:** vllm/model_executor/model_loader/default_loader.py **Line:** 311:316 **Comment:** *Logic Error: The new EP filter initializer unconditionally calls `get_current_vllm_config()`, which raises an assertion when `load_weights` is used outside a config context; this breaks the standalone loader API. Guard this by first checking whether MoE filtering is needed and using the optional config getter so non-EP/non-context paths still work. Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise.

👍 | 👎

codeant-ai · 2026-04-13T09:04:08Z

CodeAnt AI finished reviewing your PR.

refacto-visz · 2026-04-13T09:08:31Z

Code Review: Expert Parallelism Weight Filter

PR Confidence Score: 🟥 0 / 5

👍 Well Done

Pre-Disk Expert Filtering

Skipping non-local expert weights before disk read eliminates 85-90% I/O overhead for EP configurations, a significant architectural win.

Clean Module Separation

EP filtering logic is cleanly isolated in a dedicated module with clear single responsibility, improving testability and maintainability.

Expert ID Regex Precision

Regex correctly excludes 3D fused-expert tensors without a numeric ID, preventing false positives in the filtering logic.

Comprehensive Test Coverage

Tests cover boundary experts, round-robin, uneven splits, fused tensors, and tensor value integrity across all EP ranks.

📁 Selected files for review (4)

tests/model_executor/model_loader/test_ep_weight_filter.py
vllm/model_executor/model_loader/default_loader.py
vllm/model_executor/model_loader/ep_weight_filter.py
vllm/model_executor/model_loader/weight_utils.py

🎯 Custom Instructions

❌ Unapplied Instructions

Organization Guidelines

Reason: Your set path patterns [src/, config/] don't match any selected files for review; Your set extension patterns [.java] don't match any selected files for review

📝 Additional Comments

vllm/model_executor/model_loader/ep_weight_filter.py (2)
Regex Search on Every Weight Name Without Substring Pre-Check
parse_expert_id is called for every tensor key across all shard files in the innermost loop of safetensors_weights_iterator; for a 384-expert model across hundreds of shards this means tens of thousands of regex .search() calls, the vast majority on dense/shared-expert weight names that will never match. Adding a fast '.experts.' not in weight_name substring pre-check before the regex invocation would short-circuit the dominant non-expert case with a single cheap string operation, reducing regex calls to only the small fraction of names containing the expert segment (typically a 10:1 or higher ratio).
    if '.experts.' not in weight_name:
        return None
    m = _EXPERT_ID_RE.search(weight_name)
    return int(m.group(1)) if m else None
Context References

vllm/model_executor/model_loader/weight_utils.py -> safetensors_weights_iterator — calls should_skip_weight inside per-key loop for all three load strategies

vllm/model_executor/model_loader/ep_weight_filter.py -> _EXPERT_ID_RE — module-level compiled regex used inside parse_expert_id
Placement Strategy String Dispatch Violates OCP

String-based dispatch with if/elif for placement strategies violates the Open/Closed Principle; adding a new strategy requires modifying this function directly. The placement parameter also accepts arbitrary strings with no type constraint, making invalid values a runtime-only error. Using a Literal["linear", "round_robin"] annotation or an Enum would surface misuse at static analysis time, while a registry dict would allow extension without modification and make valid values self-documenting.

Standards:

SOLID: Open/Closed Principle
tests/model_executor/model_loader/test_ep_weight_filter.py (1)

Missing Negative Test for Invalid Placement Strategy

TestComputeLocalExpertIds validates linear and round_robin placements but has no test asserting that an invalid placement string raises ValueError. Without this test, a future refactor that silently swallows unknown strategies would go undetected, removing a safety net that the implementation currently provides.

Context References

vllm/model_executor/model_loader/ep_weight_filter.py -> compute_local_expert_ids — raises ValueError for unknown placement at line 61

🧰 Additional context used

tests/model_executor/model_loader/test_ep_weight_filter.py (2)

vllm/model_executor/model_loader/weight_utils.py (2)

download_weights_from_hf (473-577)

safetensors_weights_iterator (723-788)

vllm/model_executor/model_loader/default_loader.py (1)

vllm/model_executor/model_loader/default_loader.py (1)

logger (40)

vllm/model_executor/model_loader/weight_utils.py (2)

vllm/model_executor/model_loader/weight_utils.py (2)

download_weights_from_hf (473-577)

safetensors_weights_iterator (723-788)

refacto-visz · 2026-04-13T09:08:35Z

+def compute_local_expert_ids(
+    num_experts: int,
+    ep_size: int,
+    ep_rank: int,
+    placement: str = "linear",
+) -> set[int] | None:


Unchecked EP Rank Bounds in compute_local_expert_ids

No validation is performed to ensure ep_rank is within [0, ep_size) or that num_experts is positive before computing the expert range. A caller passing ep_rank >= ep_size produces a start offset beyond num_experts, silently returning an empty or out-of-bounds set that causes the rank to load zero expert weights — resulting in silent model corruption or inference failures with no error signal. This also means a misconfigured parallel_config propagating from _init_ep_weight_filter has no safety net at the point of use.

def compute_local_expert_ids( num_experts: int, ep_size: int, ep_rank: int, placement: str = "linear", ) -> set[int] | None: if ep_size <= 1: return None if num_experts <= 0: raise ValueError(f"num_experts must be positive, got {num_experts}.") if not (0 <= ep_rank < ep_size): raise ValueError( f"ep_rank {ep_rank} is out of range for ep_size {ep_size}. " "ep_rank must satisfy 0 <= ep_rank < ep_size." ) if placement == "linear": base = num_experts // ep_size remainder = num_experts % ep_size start = ep_rank * base + min(ep_rank, remainder) local_count = base + (1 if ep_rank < remainder else 0) return set(range(start, start + local_count))

Commitable Suggestion

Suggested change

def compute_local_expert_ids(

num_experts: int,

ep_size: int,

ep_rank: int,

placement: str = "linear",

) -> set[int] | None:

def compute_local_expert_ids(

num_experts: int,

ep_size: int,

ep_rank: int,

placement: str = "linear",

) -> set[int] | None:

if ep_size <= 1:

return None

if num_experts <= 0:

raise ValueError(f"num_experts must be positive, got {num_experts}.")

if not (0 <= ep_rank < ep_size):

raise ValueError(

f"ep_rank {ep_rank} is out of range for ep_size {ep_size}. "

"ep_rank must satisfy 0 <= ep_rank < ep_size."

)

if placement == "linear":

base = num_experts // ep_size

remainder = num_experts % ep_size

start = ep_rank * base + min(ep_rank, remainder)

local_count = base + (1 if ep_rank < remainder else 0)

return set(range(start, start + local_count))

Standards

CWE-20: Improper Input Validation

Context References

vllm/model_executor/model_loader/ep_weight_filter.py - compute_local_expert_ids - no bounds check on ep_rank parameter

vllm/model_executor/model_loader/default_loader.py - _init_ep_weight_filter lines 341-346 — passes ep_rank without prior validation

refacto-visz · 2026-04-13T09:08:36Z

            allow_patterns_overrides=None,
        )

+    def _init_ep_weight_filter(self, model_config: ModelConfig) -> None:


Silent EP Filter Failure on Config or Rank Retrieval Error

The method has no exception handling around get_current_vllm_config(), get_num_experts(), or the distributed group accessors; any runtime error (e.g., distributed group not yet initialized) propagates uncaught into load_weights, corrupting the loading pipeline. Additionally, when any parallelism dimension equals 1, the corresponding rank is hardcoded to 0 instead of being read from the process group — if a process group is misconfigured or returns an unexpected rank, ep_rank is silently wrong, causing a rank to load another rank's expert weights with no error surfaced. The computed ep_rank is also never validated against ep_size before being passed to compute_local_expert_ids.

def _init_ep_weight_filter(self, model_config: ModelConfig) -> None: try: from vllm.config import get_current_vllm_config vllm_config = get_current_vllm_config() except Exception as e: logger.warning( "EP weight filter disabled: failed to retrieve vllm config: %s", e ) return parallel_config = vllm_config.parallel_config if not (model_config.is_moe and parallel_config.enable_expert_parallel): return try: num_experts = model_config.get_num_experts() except Exception as e: logger.warning( "EP weight filter disabled: failed to get num_experts: %s", e ) return if num_experts <= 0: return from vllm.distributed import ( get_dp_group, get_pcp_group, get_tensor_model_parallel_rank, ) dp_size = parallel_config.data_parallel_size tp_size = parallel_config.tensor_parallel_size pcp_size = parallel_config.prefill_context_parallel_size dp_rank = get_dp_group().rank_in_group tp_rank = get_tensor_model_parallel_rank() pcp_rank = get_pcp_group().rank_in_group ep_size = dp_size * pcp_size * tp_size ep_rank = dp_rank * pcp_size * tp_size + pcp_rank * tp_size + tp_rank if not (0 <= ep_rank < ep_size): raise RuntimeError( f"Computed ep_rank {ep_rank} is outside valid range [0, {ep_size}). " "Expert weight isolation cannot be guaranteed; aborting load." ) self.local_expert_ids = compute_local_expert_ids( num_experts, ep_size, ep_rank, placement=parallel_config.expert_placement_strategy, )

Commitable Suggestion

Suggested change

def _init_ep_weight_filter(self, model_config: ModelConfig) -> None:

def _init_ep_weight_filter(self, model_config: ModelConfig) -> None:

try:

from vllm.config import get_current_vllm_config

vllm_config = get_current_vllm_config()

except Exception as e:

logger.warning(

"EP weight filter disabled: failed to retrieve vllm config: %s", e

)

return

parallel_config = vllm_config.parallel_config

if not (model_config.is_moe and parallel_config.enable_expert_parallel):

return

try:

num_experts = model_config.get_num_experts()

except Exception as e:

logger.warning(

"EP weight filter disabled: failed to get num_experts: %s", e

)

return

if num_experts <= 0:

return

from vllm.distributed import (

get_dp_group,

get_pcp_group,

get_tensor_model_parallel_rank,

)

dp_size = parallel_config.data_parallel_size

tp_size = parallel_config.tensor_parallel_size

pcp_size = parallel_config.prefill_context_parallel_size

dp_rank = get_dp_group().rank_in_group

tp_rank = get_tensor_model_parallel_rank()

pcp_rank = get_pcp_group().rank_in_group

ep_size = dp_size * pcp_size * tp_size

ep_rank = dp_rank * pcp_size * tp_size + pcp_rank * tp_size + tp_rank

if not (0 <= ep_rank < ep_size):

raise RuntimeError(

f"Computed ep_rank {ep_rank} is outside valid range [0, {ep_size}). "

"Expert weight isolation cannot be guaranteed; aborting load."

)

self.local_expert_ids = compute_local_expert_ids(

num_experts,

ep_size,

ep_rank,

placement=parallel_config.expert_placement_strategy,

)

Standards

CWE-390: Detection of Error Condition Without Action

CWE-20: Improper Input Validation

Context References

vllm/model_executor/model_loader/default_loader.py - DefaultModelLoader._init_ep_weight_filter — no try/except wrapping get_current_vllm_config or distributed group accessors; conditional rank fetch only when size > 1

refacto-visz · 2026-04-13T09:08:37Z

            with safe_open(st_file, framework="pt") as f:
                state_dict = {}
                for name in f.keys():  # noqa: SIM118
+                    if should_skip_weight(name, local_expert_ids):
+                        continue
                    state_dict[name] = f.get_tensor(name)

                # update with leftover tensor data from previous iteration, if any


Torchao Leftover Dict Bypasses EP Filter

In the torchao path, leftover_state_dict from a previous shard is merged into state_dict after EP filtering has already been applied; if leftover_state_dict contains tensors belonging to non-local experts from the prior shard, they bypass should_skip_weight entirely and are injected into the current shard's state dict. This silently loads non-local expert weights despite EP filtering being active, increasing peak memory usage and potentially overwriting correctly filtered state — defeating the core purpose of the EP filter for multi-shard torchao checkpoints.

with safe_open(st_file, framework="pt") as f: state_dict = {} for name in f.keys(): # noqa: SIM118 if should_skip_weight(name, local_expert_ids): continue state_dict[name] = f.get_tensor(name) # update with leftover tensor data from previous iteration, if any # re-apply EP filter to exclude any non-local expert leftovers filtered_leftover = { k: v for k, v in leftover_state_dict.items() if not should_skip_weight(k, local_expert_ids) } state_dict.update(filtered_leftover)

Commitable Suggestion

Suggested change

with safe_open(st_file, framework="pt") as f:

state_dict = {}

for name in f.keys(): # noqa: SIM118

if should_skip_weight(name, local_expert_ids):

continue

state_dict[name] = f.get_tensor(name)

# update with leftover tensor data from previous iteration, if any

with safe_open(st_file, framework="pt") as f:

state_dict = {}

for name in f.keys(): # noqa: SIM118

if should_skip_weight(name, local_expert_ids):

continue

state_dict[name] = f.get_tensor(name)

# update with leftover tensor data from previous iteration, if any

# re-apply EP filter to exclude any non-local expert leftovers

filtered_leftover = {

k: v for k, v in leftover_state_dict.items()

if not should_skip_weight(k, local_expert_ids)

}

state_dict.update(filtered_leftover)

Standards

CWE-669: Incorrect Resource Transfer Between Spheres

Context References

vllm/model_executor/model_loader/weight_utils.py - safetensors_weights_iterator torchao branch — leftover_state_dict.update after filter loop at line 772

refacto-visz · 2026-04-13T09:08:39Z

        if safetensors_load_strategy == "eager":
            with open(st_file, "rb") as f:
                state_dict = load(f.read())
-            yield from state_dict.items()
+            for name, param in state_dict.items():
+                if not should_skip_weight(name, local_expert_ids):
+                    yield name, param


Eager Load Strategy Reads All Tensors Before EP Filter

The eager strategy reads the entire shard file into memory and deserialises all tensors before the EP filter is applied, meaning non-local expert tensors are fully allocated in RAM and then discarded. For a 384-expert model this wastes the majority of peak memory and completely negates the I/O benefit the EP filter provides for the lazy and torchao paths, which correctly call should_skip_weight before f.get_tensor(name). This inconsistency means EP memory savings are silently absent for users on the eager strategy.

if safetensors_load_strategy == "eager": with safe_open(st_file, framework="pt") as f: for name in f.keys(): # noqa: SIM118 if should_skip_weight(name, local_expert_ids): continue yield name, f.get_tensor(name)

Commitable Suggestion

Suggested change

if safetensors_load_strategy == "eager":

with open(st_file, "rb") as f:

state_dict = load(f.read())

yield from state_dict.items()

for name, param in state_dict.items():

if not should_skip_weight(name, local_expert_ids):

yield name, param

if safetensors_load_strategy == "eager":

with safe_open(st_file, framework="pt") as f:

for name in f.keys(): # noqa: SIM118

if should_skip_weight(name, local_expert_ids):

continue

yield name, f.get_tensor(name)

Context References

vllm/model_executor/model_loader/weight_utils.py - safetensors_weights_iterator — lazy path uses safe_open and filters before get_tensor; eager path loads entire file first

refacto-visz · 2026-04-13T09:08:40Z

+        dp_size = parallel_config.data_parallel_size
+        tp_size = parallel_config.tensor_parallel_size
+        pcp_size = parallel_config.prefill_context_parallel_size
+        dp_rank = get_dp_group().rank_in_group if dp_size > 1 else 0
+        tp_rank = get_tensor_model_parallel_rank() if tp_size > 1 else 0
+        pcp_rank = get_pcp_group().rank_in_group if pcp_size > 1 else 0
+        ep_size = dp_size * pcp_size * tp_size
+        ep_rank = dp_rank * pcp_size * tp_size + pcp_rank * tp_size + tp_rank


EP Rank Computation Duplicated from FusedMoEParallelConfig

The docstring on line 323 explicitly states this mirrors FusedMoEParallelConfig.make(), meaning the same ep_size/ep_rank derivation logic exists in at least two places. Any future change to the EP rank formula must be synchronized across both sites, creating a DRY violation and a latent correctness risk — a divergence between the two implementations would cause the loader to filter for a different expert set than the model actually uses, producing silent weight corruption. This computation should be delegated to FusedMoEParallelConfig directly.

from vllm.model_executor.layers.fused_moe.layer import FusedMoEParallelConfig ep_parallel_config = FusedMoEParallelConfig.make(parallel_config) ep_size = ep_parallel_config.ep_size ep_rank = ep_parallel_config.ep_rank

Commitable Suggestion

Suggested change

dp_size = parallel_config.data_parallel_size

tp_size = parallel_config.tensor_parallel_size

pcp_size = parallel_config.prefill_context_parallel_size

dp_rank = get_dp_group().rank_in_group if dp_size > 1 else 0

tp_rank = get_tensor_model_parallel_rank() if tp_size > 1 else 0

pcp_rank = get_pcp_group().rank_in_group if pcp_size > 1 else 0

ep_size = dp_size * pcp_size * tp_size

ep_rank = dp_rank * pcp_size * tp_size + pcp_rank * tp_size + tp_rank

from vllm.model_executor.layers.fused_moe.layer import FusedMoEParallelConfig

ep_parallel_config = FusedMoEParallelConfig.make(parallel_config)

ep_size = ep_parallel_config.ep_size

ep_rank = ep_parallel_config.ep_rank

Standards

DRY Principle

Context References

vllm/model_executor/model_loader/default_loader.py - _init_ep_weight_filter rank/size computation lines 332-339 — docstring notes this mirrors FusedMoEParallelConfig.make()

refacto-visz · 2026-04-13T09:08:42Z

            allow_patterns_overrides=None,
        )

+    def _init_ep_weight_filter(self, model_config: ModelConfig) -> None:


SRP Violation: Loader Owns Distributed Rank Resolution

DefaultModelLoader now owns distributed rank resolution, MoE topology detection, and expert-set computation in addition to its primary weight-loading responsibility. This broadens the class's change surface: changes to EP rank derivation, placement strategy enumeration, or distributed group APIs all require modifying the loader. The mutable self.local_expert_ids initialized to None in __init__ and set later by this method also creates temporal coupling where callers of get_all_weights silently receive different behavior depending on call order. Extracting this logic into a dedicated factory function in ep_weight_filter.py would restore single responsibility and make the loader testable without a distributed context.

def get_all_weights( self, model_config: ModelConfig, model: nn.Module, local_expert_ids: set[int] | None = None, ):

Commitable Suggestion

Suggested change

def _init_ep_weight_filter(self, model_config: ModelConfig) -> None:

def get_all_weights(

self,

model_config: ModelConfig,

model: nn.Module,

local_expert_ids: set[int] | None = None,

):

Standards

SOLID: Single Responsibility Principle

Context References

vllm/model_executor/model_loader/default_loader.py - safetensors_weights_iterator — local_expert_ids passed from self.local_expert_ids; __init__ sets local_expert_ids to None at line 76

esmeetu added 5 commits March 16, 2026 10:16

Skip non-local expert weights during EP model loading

ef5c893

Signed-off-by: esmeetu <jasonailu87@gmail.com>

add test

9078473

Signed-off-by: esmeetu <jasonailu87@gmail.com>

format

970cb4f

Signed-off-by: esmeetu <jasonailu87@gmail.com>

fix import

b05bbda

Signed-off-by: esmeetu <jasonailu87@gmail.com>

fix comments

61c8936

Signed-off-by: esmeetu <jasonailu87@gmail.com>

codeant-ai Bot added the size:XL This PR changes 500-999 lines, ignoring generated files label Apr 13, 2026

codeant-ai Bot reviewed Apr 13, 2026

View reviewed changes

gemini-code-assist Bot reviewed Apr 13, 2026

View reviewed changes

coderabbitai Bot reviewed Apr 13, 2026

View reviewed changes

codeant-ai Bot reviewed Apr 13, 2026

View reviewed changes

refacto-visz Bot reviewed Apr 13, 2026

View reviewed changes

-def compute_local_expert_ids(
-    num_experts: int,
-    ep_size: int,
-    ep_rank: int,
-    placement: str = "linear",
-) -> set[int] | None:
+def compute_local_expert_ids(
+    num_experts: int,
+    ep_size: int,
+    ep_rank: int,
+    placement: str = "linear",
+) -> set[int] | None:
+    if ep_size <= 1:
+        return None
+    if num_experts <= 0:
+        raise ValueError(f"num_experts must be positive, got {num_experts}.")
+    if not (0 <= ep_rank < ep_size):
+        raise ValueError(
+            f"ep_rank {ep_rank} is out of range for ep_size {ep_size}. "
+            "ep_rank must satisfy 0 <= ep_rank < ep_size."
+        )
+    if placement == "linear":
+        base = num_experts // ep_size
+        remainder = num_experts % ep_size
+        start = ep_rank * base + min(ep_rank, remainder)
+        local_count = base + (1 if ep_rank < remainder else 0)
+        return set(range(start, start + local_count))

-    def _init_ep_weight_filter(self, model_config: ModelConfig) -> None:
+    def get_all_weights(
+        self,
+        model_config: ModelConfig,
+        model: nn.Module,
+        local_expert_ids: set[int] | None = None,
+    ):

Conversation

visz11 commented Apr 13, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CodeAnt-AI Description

What Changed

Impact

Checking Your Pull Request

Talking to CodeAnt AI

Example

Preserve Org Learnings with CodeAnt

Example

Retrigger review

Check Your Repository Health

Summary by CodeRabbit

Uh oh!

codeant-ai Bot commented Apr 13, 2026

Thanks for using CodeAnt! 🎉

Uh oh!

coderabbitai Bot commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Poem

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

refacto-visz Bot commented Apr 13, 2026

Refacto PR Summary

Change Highlights

Sequence Diagram

Testing Guide

Uh oh!

refacto-visz Bot commented Apr 13, 2026

Uh oh!

codeant-ai Bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

codeant-ai Bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

codeant-ai Bot commented Apr 13, 2026

Uh oh!

refacto-visz Bot commented Apr 13, 2026

Code Review: Expert Parallelism Weight Filter

PR Confidence Score: 🟥 0 / 5

Uh oh!

refacto-visz Bot Apr 13, 2026

Choose a reason for hiding this comment

Unchecked EP Rank Bounds in compute_local_expert_ids

Uh oh!

refacto-visz Bot Apr 13, 2026

Choose a reason for hiding this comment

Silent EP Filter Failure on Config or Rank Retrieval Error

Uh oh!

refacto-visz Bot Apr 13, 2026

Choose a reason for hiding this comment

Torchao Leftover Dict Bypasses EP Filter

Uh oh!

refacto-visz Bot Apr 13, 2026

Choose a reason for hiding this comment

Eager Load Strategy Reads All Tensors Before EP Filter

Uh oh!

refacto-visz Bot Apr 13, 2026

Choose a reason for hiding this comment

EP Rank Computation Duplicated from FusedMoEParallelConfig

Uh oh!

refacto-visz Bot Apr 13, 2026

Choose a reason for hiding this comment

SRP Violation: Loader Owns Distributed Rank Resolution

visz11 commented Apr 13, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 13, 2026 •

edited

Loading