Skip to content

Conversation

@wuhang2014
Copy link
Collaborator

@wuhang2014 wuhang2014 commented Nov 28, 2025

Purpose

Initial support EC Connector stats

Test Plan

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: wuhang <[email protected]>
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds initial support for EC (Encoder Cache) Connector statistics collection and reporting. The implementation introduces a metrics infrastructure similar to the existing KV Connector stats, enabling monitoring of encoder cache load/save operations and their performance characteristics.

  • Adds ECConnectorStats base class and MooncakeECConnectorStats implementation for tracking load/save metrics
  • Integrates stats collection into the EC connector flow from worker to scheduler to stats reporting
  • Updates the scheduler to include EC connector stats alongside existing KV connector stats

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
vllm/distributed/ec_transfer/ec_connector/metrics.py New file defining ECConnectorStats base class with abstract methods for reset, aggregate, reduce, and is_empty
vllm/distributed/ec_transfer/ec_connector/base.py Adds abstract get_stats() method to ECConnectorBase
vllm/distributed/ec_transfer/ec_connector/mooncake_storage_connector.py Implements MooncakeECConnectorStats with load timing tracking and stats aggregation; wraps load operations with timing context manager
vllm/v1/outputs.py Adds ec_connector_stats field to ECConnectorOutput; reformats EMPTY_MODEL_RUNNER_OUTPUT for readability
vllm/v1/worker/ec_connector_model_runner_mixin.py Collects stats from EC connector after model execution
vllm/v1/core/sched/scheduler.py Threads EC connector stats through scheduler's update_from_output() and make_stats() methods
vllm/v1/metrics/stats.py Adds ec_connector_stats field to SchedulerStats

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +192 to +200
@abstractmethod
def get_stats(self) -> Any:
"""
Get the statistics of the connector.
Returns:
Statistics object.
"""
pass
Copy link

Copilot AI Nov 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The get_stats method is marked as @abstractmethod but ECSharedStorageConnector (in shared_storage_connector.py) does not implement this method. This will cause instantiation failures for ECSharedStorageConnector. Either remove the @abstractmethod decorator to make it optional, or ensure all subclasses implement this method.

Copilot uses AI. Check for mistakes.
@dataclass
class ECConnectorStats:
"""
Base class for EC Connector Stats, a container for transfer performance
Copy link

Copilot AI Nov 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trailing whitespace at the end of the line. Consider removing it for consistent code style.

Suggested change
Base class for EC Connector Stats, a container for transfer performance
Base class for EC Connector Stats, a container for transfer performance

Copilot uses AI. Check for mistakes.
class ECConnectorStats:
"""
Base class for EC Connector Stats, a container for transfer performance
metrics or otherwise important telemetry from the connector.
Copy link

Copilot AI Nov 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trailing whitespace at the end of the line. Consider removing it for consistent code style.

Suggested change
metrics or otherwise important telemetry from the connector.
metrics or otherwise important telemetry from the connector.

Copilot uses AI. Check for mistakes.

def reduce(self) -> dict[str, Union[int, float]]:
"""
Reduce the observations collected during a time interval to one or
Copy link

Copilot AI Nov 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trailing whitespace at the end of the line. Consider removing it for consistent code style.

Suggested change
Reduce the observations collected during a time interval to one or
Reduce the observations collected during a time interval to one or

Copilot uses AI. Check for mistakes.
def reduce(self) -> dict[str, Union[int, float]]:
"""
Reduce the observations collected during a time interval to one or
more representative values (eg avg/median/sum of the series).
Copy link

Copilot AI Nov 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trailing whitespace at the end of the line. Consider removing it for consistent code style.

Suggested change
more representative values (eg avg/median/sum of the series).
more representative values (eg avg/median/sum of the series).

Copilot uses AI. Check for mistakes.
Comment on lines +231 to +237
def aggregate(self, other: ECConnectorStats) -> ECConnectorStats:
if not other.is_empty():
self.data["load_time_ms"] += other.data["load_time_ms"]
self.data["save_time_ms"] += other.data["save_time_ms"]
self.data["num_loads"] += other.data["num_loads"]
self.data["num_saves"] += other.data["num_saves"]
return self
Copy link

Copilot AI Nov 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The aggregate method accepts ECConnectorStats but directly accesses keys specific to MooncakeECConnectorStats (e.g., other.data["load_time_ms"]). If a different ECConnectorStats subclass is passed, this will raise a KeyError. Consider either:

  1. Type-checking with isinstance(other, MooncakeECConnectorStats) before accessing the keys, or
  2. Changing the parameter type to "MooncakeECConnectorStats" to make the expectation explicit.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant