Conversation
Signed-off-by: esmeetu <jasonailu87@gmail.com>
Signed-off-by: esmeetu <jasonailu87@gmail.com>
Signed-off-by: esmeetu <jasonailu87@gmail.com>
|
CodeAnt AI is reviewing your PR. Thanks for using CodeAnt! 🎉We're free for open-source projects. if you're enjoying it, help us grow by sharing. Share on X · |
📝 WalkthroughWalkthroughThis pull request adds expert parallelism (EP) weight filtering functionality to vLLM's model loader. It introduces a new utility module that filters non-local expert tensors during model loading, integrates the filtering into the default model loader's weight iteration path, and provides comprehensive tests for the filtering logic across various expert configurations and weight naming patterns. Changes
Sequence DiagramsequenceDiagram
participant DML as DefaultModelLoader
participant EPWF as EP Weight Filter
participant SWI as Safetensors<br/>Weights Iterator
participant Dist as Distributed<br/>System
DML->>Dist: Query rank & group info
Dist-->>DML: ep_rank, ep_size
DML->>EPWF: compute_local_expert_ids(num_experts,<br/>ep_size, ep_rank)
EPWF-->>DML: set[local_expert_ids]
DML->>SWI: safetensors_weights_iterator(...,<br/>local_expert_ids=...)
loop For each weight in safetensors
SWI->>EPWF: should_skip_weight(weight_name,<br/>local_expert_ids)
EPWF->>EPWF: parse_expert_id(weight_name)
alt Expert tensor & not local
EPWF-->>SWI: skip=True
SWI->>SWI: omit from iteration
else Local expert or dense weight
EPWF-->>SWI: skip=False
SWI->>DML: yield (weight_name, tensor)
end
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Refacto PR SummaryIntroduces EP (Expert Parallelism) weight filtering during model loading to eliminate redundant I/O for MoE models. Each rank now skips non-local expert tensors before reading from disk, reducing storage I/O by ~85-90% for large MoE models under EP deployments. The implementation adds a dedicated Key Changes:
Change HighlightsClick to expand
Sequence DiagramsequenceDiagram
participant DL as DefaultModelLoader
participant EF as ep_weight_filter
participant PC as ParallelConfig
participant WU as safetensors_weights_iterator
participant DISK as Safetensors Files
DL->>DL: load_weights(model, model_config)
DL->>DL: _init_ep_weight_filter(model_config)
DL->>PC: get ep_size, ep_rank (dp×pcp×tp)
DL->>EF: compute_local_expert_ids(num_experts, ep_size, ep_rank, placement)
EF-->>DL: set[int] local_expert_ids (or None if ep_size≤1)
DL->>WU: get_all_weights(..., local_expert_ids=self.local_expert_ids)
loop Each safetensors shard
WU->>DISK: open shard, iterate tensor names
WU->>EF: should_skip_weight(name, local_expert_ids)
EF->>EF: parse_expert_id(name) via regex
alt expert id not in local_expert_ids
EF-->>WU: True → skip (no disk read)
else dense / shared / local expert
EF-->>WU: False → read tensor
WU-->>DL: yield (name, tensor)
end
end
Testing GuideClick to expand
|
|
Refacto is reviewing this PR. Please wait for the review comments to be posted. |
| self.local_expert_ids = compute_local_expert_ids( | ||
| num_experts, | ||
| ep_size, | ||
| ep_rank, | ||
| placement=parallel_config.expert_placement_strategy, | ||
| ) |
There was a problem hiding this comment.
🟠 Architect Review — HIGH
EP weight filtering derives local expert IDs from parallel_config.expert_placement_strategy, but FusedMoE may downgrade "round_robin" to "linear" at runtime via determine_expert_placement_strategy. In configurations where this fallback occurs, the loader computes local_expert_ids assuming round_robin while the MoE layer actually uses linear placement, so required local expert weights for the true placement can be skipped during loading.
Suggestion: Compute local_expert_ids using the same resolved placement logic as FusedMoE (e.g., by applying determine_expert_placement_strategy to the parallel config before calling compute_local_expert_ids, or by sharing a single source of truth for the final expert placement strategy) so the loader's expert set always matches the runtime expert map.
There was a problem hiding this comment.
Code Review
This pull request implements expert parallelism (EP) weight filtering to optimize model loading by skipping non-local expert weights. This change aims to significantly reduce storage I/O and memory overhead for Mixture-of-Experts (MoE) models. The implementation includes a new filtering utility, integration into the default model loader, and comprehensive tests. Review feedback highlighted that the 'eager' loading strategy still reads entire files into memory before filtering, which limits the I/O reduction benefits for that specific loading path.
| if safetensors_load_strategy == "eager": | ||
| with open(st_file, "rb") as f: | ||
| state_dict = load(f.read()) | ||
| yield from state_dict.items() | ||
| for name, param in state_dict.items(): | ||
| if not should_skip_weight(name, local_expert_ids): | ||
| yield name, param |
There was a problem hiding this comment.
In the eager loading strategy, the entire safetensors file is read into memory via load(f.read()) before filtering occurs. While this correctly filters the yielded tensors, it does not provide the storage I/O reduction benefit described in the PR (skipping non-local tensors before they are read from disk).
For the eager strategy to benefit from I/O reduction, you would need to use safe_open to inspect keys and only get_tensor for the ones that aren't skipped. However, since eager is often used when the user specifically wants to load the whole file into memory at once, this might be acceptable, but it's worth noting that the performance gain here is only in memory/CPU, not disk I/O.
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
vllm/model_executor/model_loader/default_loader.py (1)
226-244:⚠️ Potential issue | 🟡 MinorEP weight filtering is not applied to alternative weight loaders.
The
local_expert_idsparameter is only passed tosafetensors_weights_iterator(line 250), but not tofastsafetensors_weights_iterator,instanttensor_weights_iterator, ormulti_thread_safetensors_weights_iterator.When using
load_format=fastsafetensors,instanttensor, orenable_multithread_load=True, MoE models with EP enabled will still load all expert weights to every rank, defeating the I/O optimization.Consider either:
- Extending these iterators to support EP filtering, or
- Adding a warning when EP is enabled but the load format doesn't support filtering
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@vllm/model_executor/model_loader/default_loader.py` around lines 226 - 244, EP filtering (local_expert_ids) is only applied to safetensors_weights_iterator and missing for fastsafetensors_weights_iterator, instanttensor_weights_iterator, and multi_thread_safetensors_weights_iterator; either extend those iterator functions to accept and honor a local_expert_ids parameter and pass it from where load_format is checked (refer to load_config.load_format, local_expert_ids, fastsafetensors_weights_iterator, instanttensor_weights_iterator, multi_thread_safetensors_weights_iterator), or if extending is not feasible, add a clear warning when EP is enabled and the chosen load_format does not support expert filtering so users know I/O optimization will be disabled; implement the change by updating the iterator signatures and call sites here to forward local_expert_ids, or by emitting the warning in this conditional branch.
🧹 Nitpick comments (2)
vllm/model_executor/model_loader/weight_utils.py (1)
746-751: In eager mode, EP filtering happens after the full shard is loaded into memory.The
load(f.read())call at line 748 reads the entire safetensors file into memory before the filtering loop at lines 749-751. While the filtering still prevents non-local expert tensors from being yielded downstream, the disk I/O savings mentioned in the docstring ("skipped before reading from disk") don't fully apply to this path.This is acceptable since eager mode is not the default, but consider adding a comment noting this limitation for clarity.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@vllm/model_executor/model_loader/weight_utils.py` around lines 746 - 751, The eager safetensors path (when safetensors_load_strategy == "eager") currently calls load(f.read()) which reads the entire st_file into memory before applying EP filtering in the should_skip_weight loop; add a clear inline comment above that block referencing safetensors_load_strategy, st_file, load, and should_skip_weight that documents this limitation (i.e., that eager mode does not avoid disk I/O for skipped experts because the full shard is read into memory) so readers understand why the docstring’s “skipped before reading from disk” claim does not apply to eager mode.tests/model_executor/model_loader/test_ep_weight_filter.py (1)
293-332: Rename unusedexpectedvariables to suppress linter warnings.The
expectedvariable from the fixture unpacking is unused in these test methods. Prefix with underscore to indicate intentional non-use.✏️ Proposed fix
def test_no_filter_returns_all(self, synthetic_moe_files): - files, expected = synthetic_moe_files + files, _ = synthetic_moe_files loaded = dict(safetensors_weights_iterator(files, False)) assert set(loaded.keys()) == set(expected.keys()) def test_ep2_rank0_gets_half_experts(self, synthetic_moe_files): - files, expected = synthetic_moe_files + files, _ = synthetic_moe_files # EP=2, rank=0 → experts 0-3 local_ids = compute_local_expert_ids(8, ep_size=2, ep_rank=0)Note: Line 296 actually uses
expected.keys(), so only line 323 should be changed:def test_ep2_rank1_gets_other_half(self, synthetic_moe_files): - files, expected = synthetic_moe_files + files, _ = synthetic_moe_files local_ids = compute_local_expert_ids(8, ep_size=2, ep_rank=1)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/model_executor/model_loader/test_ep_weight_filter.py` around lines 293 - 332, Rename the unused fixture variable "expected" to "_expected" in the test functions test_ep2_rank0_gets_half_experts and test_ep2_rank1_gets_other_half so linters know the value is intentionally unused; leave the first test_no_filter_returns_all unchanged (it uses expected.keys()). Update the function signatures where the fixture is unpacked (the tuple assignment in those two tests) to use _expected instead of expected.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@vllm/model_executor/model_loader/default_loader.py`:
- Around line 226-244: EP filtering (local_expert_ids) is only applied to
safetensors_weights_iterator and missing for fastsafetensors_weights_iterator,
instanttensor_weights_iterator, and multi_thread_safetensors_weights_iterator;
either extend those iterator functions to accept and honor a local_expert_ids
parameter and pass it from where load_format is checked (refer to
load_config.load_format, local_expert_ids, fastsafetensors_weights_iterator,
instanttensor_weights_iterator, multi_thread_safetensors_weights_iterator), or
if extending is not feasible, add a clear warning when EP is enabled and the
chosen load_format does not support expert filtering so users know I/O
optimization will be disabled; implement the change by updating the iterator
signatures and call sites here to forward local_expert_ids, or by emitting the
warning in this conditional branch.
---
Nitpick comments:
In `@tests/model_executor/model_loader/test_ep_weight_filter.py`:
- Around line 293-332: Rename the unused fixture variable "expected" to
"_expected" in the test functions test_ep2_rank0_gets_half_experts and
test_ep2_rank1_gets_other_half so linters know the value is intentionally
unused; leave the first test_no_filter_returns_all unchanged (it uses
expected.keys()). Update the function signatures where the fixture is unpacked
(the tuple assignment in those two tests) to use _expected instead of expected.
In `@vllm/model_executor/model_loader/weight_utils.py`:
- Around line 746-751: The eager safetensors path (when
safetensors_load_strategy == "eager") currently calls load(f.read()) which reads
the entire st_file into memory before applying EP filtering in the
should_skip_weight loop; add a clear inline comment above that block referencing
safetensors_load_strategy, st_file, load, and should_skip_weight that documents
this limitation (i.e., that eager mode does not avoid disk I/O for skipped
experts because the full shard is read into memory) so readers understand why
the docstring’s “skipped before reading from disk” claim does not apply to eager
mode.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: c1fdf492-2fe0-45c5-b531-4b0e0264d106
📒 Files selected for processing (4)
tests/model_executor/model_loader/test_ep_weight_filter.pyvllm/model_executor/model_loader/default_loader.pyvllm/model_executor/model_loader/ep_weight_filter.pyvllm/model_executor/model_loader/weight_utils.py
| from vllm.config import get_current_vllm_config | ||
|
|
||
| vllm_config = get_current_vllm_config() | ||
| parallel_config = vllm_config.parallel_config | ||
|
|
||
| if not (model_config.is_moe and parallel_config.enable_expert_parallel): |
There was a problem hiding this comment.
Suggestion: The new EP filter initializer unconditionally calls get_current_vllm_config(), which raises an assertion when load_weights is used outside a config context; this breaks the standalone loader API. Guard this by first checking whether MoE filtering is needed and using the optional config getter so non-EP/non-context paths still work. [logic error]
Severity Level: Critical 🚨
- ❌ Standalone DefaultModelLoader.load_weights crashes without config context.
- ⚠️ Breaks documented inplace weight-loading API for users.| from vllm.config import get_current_vllm_config | |
| vllm_config = get_current_vllm_config() | |
| parallel_config = vllm_config.parallel_config | |
| if not (model_config.is_moe and parallel_config.enable_expert_parallel): | |
| from vllm.config import get_current_vllm_config_or_none | |
| if not model_config.is_moe: | |
| return | |
| vllm_config = get_current_vllm_config_or_none() | |
| if vllm_config is None: | |
| return | |
| parallel_config = vllm_config.parallel_config | |
| if not parallel_config.enable_expert_parallel: |
Steps of Reproduction ✅
1. In user code, construct a loader via the public factory `get_model_loader` from
`vllm/model_executor/model_loader/__init__.py:120-125` (e.g., `load_config =
LoadConfig(load_format="safetensors"); loader = get_model_loader(load_config)`), or
instantiate `DefaultModelLoader` directly (`default_loader.py:43-76`).
2. Create a `ModelConfig` for any model (MoE or non-MoE) and an initialized `nn.Module`
instance, but do **not** wrap the code in a `set_current_vllm_config(...)` context (the
context manager is used in the engine at `vllm/v1/worker/gpu_worker.py:333-335`, but is
absent in this standalone usage).
3. Call the documented standalone API `loader.load_weights(model, model_config)`
implemented in `DefaultModelLoader.load_weights` at
`vllm/model_executor/model_loader/default_loader.py:356-371`, as suggested by the abstract
base method comment in `base_loader.py:5-8` ("This standalone API allows inplace weights
loading for an already-initialized model").
4. `load_weights` unconditionally calls `self._init_ep_weight_filter(model_config)`
(`default_loader.py:367`), which in turn imports and calls `get_current_vllm_config()` at
`default_loader.py:311-313`; since no vLLM config has been set,
`get_current_vllm_config()` in `vllm/config/vllm.py:19-28` sees `_current_vllm_config is
None` and raises `AssertionError("Current vLLM config is not set...")`, causing
`DefaultModelLoader.load_weights` to crash before any weights are loaded in standalone
usage.Prompt for AI Agent 🤖
This is a comment left during a code review.
**Path:** vllm/model_executor/model_loader/default_loader.py
**Line:** 311:316
**Comment:**
*Logic Error: The new EP filter initializer unconditionally calls `get_current_vllm_config()`, which raises an assertion when `load_weights` is used outside a config context; this breaks the standalone loader API. Guard this by first checking whether MoE filtering is needed and using the optional config getter so non-EP/non-context paths still work.
Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise.|
CodeAnt AI finished reviewing your PR. |
Code Review: Expert Parallelism Weight FilterPR Confidence Score: 🟥 0 / 5👍 Well Done
📁 Selected files for review (4)
🎯 Custom Instructions
📝 Additional Comments
🧰 Additional context used
|
| def compute_local_expert_ids( | ||
| num_experts: int, | ||
| ep_size: int, | ||
| ep_rank: int, | ||
| placement: str = "linear", | ||
| ) -> set[int] | None: |
There was a problem hiding this comment.
Unchecked EP Rank Bounds in compute_local_expert_ids
No validation is performed to ensure ep_rank is within [0, ep_size) or that num_experts is positive before computing the expert range. A caller passing ep_rank >= ep_size produces a start offset beyond num_experts, silently returning an empty or out-of-bounds set that causes the rank to load zero expert weights — resulting in silent model corruption or inference failures with no error signal. This also means a misconfigured parallel_config propagating from _init_ep_weight_filter has no safety net at the point of use.
def compute_local_expert_ids(
num_experts: int,
ep_size: int,
ep_rank: int,
placement: str = "linear",
) -> set[int] | None:
if ep_size <= 1:
return None
if num_experts <= 0:
raise ValueError(f"num_experts must be positive, got {num_experts}.")
if not (0 <= ep_rank < ep_size):
raise ValueError(
f"ep_rank {ep_rank} is out of range for ep_size {ep_size}. "
"ep_rank must satisfy 0 <= ep_rank < ep_size."
)
if placement == "linear":
base = num_experts // ep_size
remainder = num_experts % ep_size
start = ep_rank * base + min(ep_rank, remainder)
local_count = base + (1 if ep_rank < remainder else 0)
return set(range(start, start + local_count))
Commitable Suggestion
| def compute_local_expert_ids( | |
| num_experts: int, | |
| ep_size: int, | |
| ep_rank: int, | |
| placement: str = "linear", | |
| ) -> set[int] | None: | |
| def compute_local_expert_ids( | |
| num_experts: int, | |
| ep_size: int, | |
| ep_rank: int, | |
| placement: str = "linear", | |
| ) -> set[int] | None: | |
| if ep_size <= 1: | |
| return None | |
| if num_experts <= 0: | |
| raise ValueError(f"num_experts must be positive, got {num_experts}.") | |
| if not (0 <= ep_rank < ep_size): | |
| raise ValueError( | |
| f"ep_rank {ep_rank} is out of range for ep_size {ep_size}. " | |
| "ep_rank must satisfy 0 <= ep_rank < ep_size." | |
| ) | |
| if placement == "linear": | |
| base = num_experts // ep_size | |
| remainder = num_experts % ep_size | |
| start = ep_rank * base + min(ep_rank, remainder) | |
| local_count = base + (1 if ep_rank < remainder else 0) | |
| return set(range(start, start + local_count)) |
Standards
- CWE-20: Improper Input Validation
Context References
vllm/model_executor/model_loader/ep_weight_filter.py-compute_local_expert_ids - no bounds check on ep_rank parametervllm/model_executor/model_loader/default_loader.py-_init_ep_weight_filter lines 341-346 — passes ep_rank without prior validation
| allow_patterns_overrides=None, | ||
| ) | ||
|
|
||
| def _init_ep_weight_filter(self, model_config: ModelConfig) -> None: |
There was a problem hiding this comment.
Silent EP Filter Failure on Config or Rank Retrieval Error
The method has no exception handling around get_current_vllm_config(), get_num_experts(), or the distributed group accessors; any runtime error (e.g., distributed group not yet initialized) propagates uncaught into load_weights, corrupting the loading pipeline. Additionally, when any parallelism dimension equals 1, the corresponding rank is hardcoded to 0 instead of being read from the process group — if a process group is misconfigured or returns an unexpected rank, ep_rank is silently wrong, causing a rank to load another rank's expert weights with no error surfaced. The computed ep_rank is also never validated against ep_size before being passed to compute_local_expert_ids.
def _init_ep_weight_filter(self, model_config: ModelConfig) -> None:
try:
from vllm.config import get_current_vllm_config
vllm_config = get_current_vllm_config()
except Exception as e:
logger.warning(
"EP weight filter disabled: failed to retrieve vllm config: %s", e
)
return
parallel_config = vllm_config.parallel_config
if not (model_config.is_moe and parallel_config.enable_expert_parallel):
return
try:
num_experts = model_config.get_num_experts()
except Exception as e:
logger.warning(
"EP weight filter disabled: failed to get num_experts: %s", e
)
return
if num_experts <= 0:
return
from vllm.distributed import (
get_dp_group,
get_pcp_group,
get_tensor_model_parallel_rank,
)
dp_size = parallel_config.data_parallel_size
tp_size = parallel_config.tensor_parallel_size
pcp_size = parallel_config.prefill_context_parallel_size
dp_rank = get_dp_group().rank_in_group
tp_rank = get_tensor_model_parallel_rank()
pcp_rank = get_pcp_group().rank_in_group
ep_size = dp_size * pcp_size * tp_size
ep_rank = dp_rank * pcp_size * tp_size + pcp_rank * tp_size + tp_rank
if not (0 <= ep_rank < ep_size):
raise RuntimeError(
f"Computed ep_rank {ep_rank} is outside valid range [0, {ep_size}). "
"Expert weight isolation cannot be guaranteed; aborting load."
)
self.local_expert_ids = compute_local_expert_ids(
num_experts,
ep_size,
ep_rank,
placement=parallel_config.expert_placement_strategy,
)
Commitable Suggestion
| def _init_ep_weight_filter(self, model_config: ModelConfig) -> None: | |
| def _init_ep_weight_filter(self, model_config: ModelConfig) -> None: | |
| try: | |
| from vllm.config import get_current_vllm_config | |
| vllm_config = get_current_vllm_config() | |
| except Exception as e: | |
| logger.warning( | |
| "EP weight filter disabled: failed to retrieve vllm config: %s", e | |
| ) | |
| return | |
| parallel_config = vllm_config.parallel_config | |
| if not (model_config.is_moe and parallel_config.enable_expert_parallel): | |
| return | |
| try: | |
| num_experts = model_config.get_num_experts() | |
| except Exception as e: | |
| logger.warning( | |
| "EP weight filter disabled: failed to get num_experts: %s", e | |
| ) | |
| return | |
| if num_experts <= 0: | |
| return | |
| from vllm.distributed import ( | |
| get_dp_group, | |
| get_pcp_group, | |
| get_tensor_model_parallel_rank, | |
| ) | |
| dp_size = parallel_config.data_parallel_size | |
| tp_size = parallel_config.tensor_parallel_size | |
| pcp_size = parallel_config.prefill_context_parallel_size | |
| dp_rank = get_dp_group().rank_in_group | |
| tp_rank = get_tensor_model_parallel_rank() | |
| pcp_rank = get_pcp_group().rank_in_group | |
| ep_size = dp_size * pcp_size * tp_size | |
| ep_rank = dp_rank * pcp_size * tp_size + pcp_rank * tp_size + tp_rank | |
| if not (0 <= ep_rank < ep_size): | |
| raise RuntimeError( | |
| f"Computed ep_rank {ep_rank} is outside valid range [0, {ep_size}). " | |
| "Expert weight isolation cannot be guaranteed; aborting load." | |
| ) | |
| self.local_expert_ids = compute_local_expert_ids( | |
| num_experts, | |
| ep_size, | |
| ep_rank, | |
| placement=parallel_config.expert_placement_strategy, | |
| ) |
Standards
- CWE-390: Detection of Error Condition Without Action
- CWE-20: Improper Input Validation
Context References
vllm/model_executor/model_loader/default_loader.py-DefaultModelLoader._init_ep_weight_filter — no try/except wrapping get_current_vllm_config or distributed group accessors; conditional rank fetch only when size > 1
| with safe_open(st_file, framework="pt") as f: | ||
| state_dict = {} | ||
| for name in f.keys(): # noqa: SIM118 | ||
| if should_skip_weight(name, local_expert_ids): | ||
| continue | ||
| state_dict[name] = f.get_tensor(name) | ||
|
|
||
| # update with leftover tensor data from previous iteration, if any |
There was a problem hiding this comment.
Torchao Leftover Dict Bypasses EP Filter
In the torchao path, leftover_state_dict from a previous shard is merged into state_dict after EP filtering has already been applied; if leftover_state_dict contains tensors belonging to non-local experts from the prior shard, they bypass should_skip_weight entirely and are injected into the current shard's state dict. This silently loads non-local expert weights despite EP filtering being active, increasing peak memory usage and potentially overwriting correctly filtered state — defeating the core purpose of the EP filter for multi-shard torchao checkpoints.
with safe_open(st_file, framework="pt") as f:
state_dict = {}
for name in f.keys(): # noqa: SIM118
if should_skip_weight(name, local_expert_ids):
continue
state_dict[name] = f.get_tensor(name)
# update with leftover tensor data from previous iteration, if any
# re-apply EP filter to exclude any non-local expert leftovers
filtered_leftover = {
k: v for k, v in leftover_state_dict.items()
if not should_skip_weight(k, local_expert_ids)
}
state_dict.update(filtered_leftover)
Commitable Suggestion
| with safe_open(st_file, framework="pt") as f: | |
| state_dict = {} | |
| for name in f.keys(): # noqa: SIM118 | |
| if should_skip_weight(name, local_expert_ids): | |
| continue | |
| state_dict[name] = f.get_tensor(name) | |
| # update with leftover tensor data from previous iteration, if any | |
| with safe_open(st_file, framework="pt") as f: | |
| state_dict = {} | |
| for name in f.keys(): # noqa: SIM118 | |
| if should_skip_weight(name, local_expert_ids): | |
| continue | |
| state_dict[name] = f.get_tensor(name) | |
| # update with leftover tensor data from previous iteration, if any | |
| # re-apply EP filter to exclude any non-local expert leftovers | |
| filtered_leftover = { | |
| k: v for k, v in leftover_state_dict.items() | |
| if not should_skip_weight(k, local_expert_ids) | |
| } | |
| state_dict.update(filtered_leftover) |
Standards
- CWE-669: Incorrect Resource Transfer Between Spheres
Context References
vllm/model_executor/model_loader/weight_utils.py-safetensors_weights_iterator torchao branch — leftover_state_dict.update after filter loop at line 772
| if safetensors_load_strategy == "eager": | ||
| with open(st_file, "rb") as f: | ||
| state_dict = load(f.read()) | ||
| yield from state_dict.items() | ||
| for name, param in state_dict.items(): | ||
| if not should_skip_weight(name, local_expert_ids): | ||
| yield name, param |
There was a problem hiding this comment.
Eager Load Strategy Reads All Tensors Before EP Filter
The eager strategy reads the entire shard file into memory and deserialises all tensors before the EP filter is applied, meaning non-local expert tensors are fully allocated in RAM and then discarded. For a 384-expert model this wastes the majority of peak memory and completely negates the I/O benefit the EP filter provides for the lazy and torchao paths, which correctly call should_skip_weight before f.get_tensor(name). This inconsistency means EP memory savings are silently absent for users on the eager strategy.
if safetensors_load_strategy == "eager":
with safe_open(st_file, framework="pt") as f:
for name in f.keys(): # noqa: SIM118
if should_skip_weight(name, local_expert_ids):
continue
yield name, f.get_tensor(name)
Commitable Suggestion
| if safetensors_load_strategy == "eager": | |
| with open(st_file, "rb") as f: | |
| state_dict = load(f.read()) | |
| yield from state_dict.items() | |
| for name, param in state_dict.items(): | |
| if not should_skip_weight(name, local_expert_ids): | |
| yield name, param | |
| if safetensors_load_strategy == "eager": | |
| with safe_open(st_file, framework="pt") as f: | |
| for name in f.keys(): # noqa: SIM118 | |
| if should_skip_weight(name, local_expert_ids): | |
| continue | |
| yield name, f.get_tensor(name) |
Context References
vllm/model_executor/model_loader/weight_utils.py-safetensors_weights_iterator — lazy path uses safe_open and filters before get_tensor; eager path loads entire file first
| dp_size = parallel_config.data_parallel_size | ||
| tp_size = parallel_config.tensor_parallel_size | ||
| pcp_size = parallel_config.prefill_context_parallel_size | ||
| dp_rank = get_dp_group().rank_in_group if dp_size > 1 else 0 | ||
| tp_rank = get_tensor_model_parallel_rank() if tp_size > 1 else 0 | ||
| pcp_rank = get_pcp_group().rank_in_group if pcp_size > 1 else 0 | ||
| ep_size = dp_size * pcp_size * tp_size | ||
| ep_rank = dp_rank * pcp_size * tp_size + pcp_rank * tp_size + tp_rank |
There was a problem hiding this comment.
EP Rank Computation Duplicated from FusedMoEParallelConfig
The docstring on line 323 explicitly states this mirrors FusedMoEParallelConfig.make(), meaning the same ep_size/ep_rank derivation logic exists in at least two places. Any future change to the EP rank formula must be synchronized across both sites, creating a DRY violation and a latent correctness risk — a divergence between the two implementations would cause the loader to filter for a different expert set than the model actually uses, producing silent weight corruption. This computation should be delegated to FusedMoEParallelConfig directly.
from vllm.model_executor.layers.fused_moe.layer import FusedMoEParallelConfig
ep_parallel_config = FusedMoEParallelConfig.make(parallel_config)
ep_size = ep_parallel_config.ep_size
ep_rank = ep_parallel_config.ep_rank
Commitable Suggestion
| dp_size = parallel_config.data_parallel_size | |
| tp_size = parallel_config.tensor_parallel_size | |
| pcp_size = parallel_config.prefill_context_parallel_size | |
| dp_rank = get_dp_group().rank_in_group if dp_size > 1 else 0 | |
| tp_rank = get_tensor_model_parallel_rank() if tp_size > 1 else 0 | |
| pcp_rank = get_pcp_group().rank_in_group if pcp_size > 1 else 0 | |
| ep_size = dp_size * pcp_size * tp_size | |
| ep_rank = dp_rank * pcp_size * tp_size + pcp_rank * tp_size + tp_rank | |
| from vllm.model_executor.layers.fused_moe.layer import FusedMoEParallelConfig | |
| ep_parallel_config = FusedMoEParallelConfig.make(parallel_config) | |
| ep_size = ep_parallel_config.ep_size | |
| ep_rank = ep_parallel_config.ep_rank |
Standards
- DRY Principle
Context References
vllm/model_executor/model_loader/default_loader.py-_init_ep_weight_filter rank/size computation lines 332-339 — docstring notes this mirrors FusedMoEParallelConfig.make()
| allow_patterns_overrides=None, | ||
| ) | ||
|
|
||
| def _init_ep_weight_filter(self, model_config: ModelConfig) -> None: |
There was a problem hiding this comment.
SRP Violation: Loader Owns Distributed Rank Resolution
DefaultModelLoader now owns distributed rank resolution, MoE topology detection, and expert-set computation in addition to its primary weight-loading responsibility. This broadens the class's change surface: changes to EP rank derivation, placement strategy enumeration, or distributed group APIs all require modifying the loader. The mutable self.local_expert_ids initialized to None in __init__ and set later by this method also creates temporal coupling where callers of get_all_weights silently receive different behavior depending on call order. Extracting this logic into a dedicated factory function in ep_weight_filter.py would restore single responsibility and make the loader testable without a distributed context.
def get_all_weights(
self,
model_config: ModelConfig,
model: nn.Module,
local_expert_ids: set[int] | None = None,
):
Commitable Suggestion
| def _init_ep_weight_filter(self, model_config: ModelConfig) -> None: | |
| def get_all_weights( | |
| self, | |
| model_config: ModelConfig, | |
| model: nn.Module, | |
| local_expert_ids: set[int] | None = None, | |
| ): |
Standards
- SOLID: Single Responsibility Principle
Context References
vllm/model_executor/model_loader/default_loader.py-safetensors_weights_iterator — local_expert_ids passed from self.local_expert_ids; __init__ sets local_expert_ids to None at line 76
CodeAnt-AI Description
Skip loading expert weights that belong to other expert-parallel ranks
What Changed
Impact
✅ Lower model load time for MoE checkpoints✅ Less disk and network I/O during expert-parallel startup✅ Fewer wasted reads on non-local expert weights💡 Usage Guide
Checking Your Pull Request
Every time you make a pull request, our system automatically looks through it. We check for security issues, mistakes in how you're setting up your infrastructure, and common code problems. We do this to make sure your changes are solid and won't cause any trouble later.
Talking to CodeAnt AI
Got a question or need a hand with something in your pull request? You can easily get in touch with CodeAnt AI right here. Just type the following in a comment on your pull request, and replace "Your question here" with whatever you want to ask:
This lets you have a chat with CodeAnt AI about your pull request, making it easier to understand and improve your code.
Example
Preserve Org Learnings with CodeAnt
You can record team preferences so CodeAnt AI applies them in future reviews. Reply directly to the specific CodeAnt AI suggestion (in the same thread) and replace "Your feedback here" with your input:
This helps CodeAnt AI learn and adapt to your team's coding style and standards.
Example
Retrigger review
Ask CodeAnt AI to review the PR again, by typing:
Check Your Repository Health
To analyze the health of your code repository, visit our dashboard at https://app.codeant.ai. This tool helps you identify potential issues and areas for improvement in your codebase, ensuring your repository maintains high standards of code health.
Summary by CodeRabbit
New Features
Tests