feature: support aggregate metrics and win rate for evaluation #5458

soksamnanglim · 2026-01-06T16:54:23Z

Issue #, if available:
No issue number, internal request.
For LLMAsJudgeEvaluator, we want to display

aggregate metrics for builtin metric and custom metric evaluations.
win rates when both base and custom model evaluations are available.

Description of changes:
Add support for aggregate metrics and win rate calculate in evaluation with sagemaker training jobs..

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Here are screenshots for manual testing for win rates and metrics aggregation.

Thank you for reviewing!
Single custom model aggregate metrics:

Win rate calculations:

Base and Custom model comparison:

For testing:
The unit tests all passed.
I also ran the integration tests in tests/integ/train/test_llm_as_judge_evaluator.py and they all passed.

The pytest logs aren't concise and provide a test execution summary:

[01/07/26 23:34:36] INFO     Final Resource Status: Succeeded                         execution.py:1053
                    INFO     Built-in metrics only evaluation        test_llm_as_judge_evaluator.py:303
                             completed successfully                                                    
PASSED
tests/integ/train/test_llm_as_judge_evaluator.py::TestLLMAsJudgeEvaluatorIntegration::test_llm_as_judge
_custom_metrics_only   

tests/integ/train/test_llm_as_judge_evaluator.py: 207 warnings
sagemaker-python-sdk/.venv-local-py311/lib/python3.11/site-packages/sagemaker/train
/evaluate/execution.py:1167: PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dum
p` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https
://errors.pydantic.dev/2.12/migration/
    execution_dict = _clean_unassigned_from_dict(self.dict())

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
============================ 5 passed, 223 warnings in 5741.12s (1:35:41) =============================

feature: support aggregate metrics and win rate for evaluation

077c7a7

soksamnanglim temporarily deployed to auto-approve January 6, 2026 16:54 — with GitHub Actions Inactive

soksamnanglim marked this pull request as ready for review January 6, 2026 21:58

nargokul approved these changes Jan 7, 2026

View reviewed changes

huiqunli approved these changes Jan 7, 2026

View reviewed changes

jam-jee approved these changes Jan 7, 2026

View reviewed changes

rsareddy0329 approved these changes Jan 8, 2026

View reviewed changes

papriwal merged commit efa8d7d into aws:master Jan 8, 2026
7 of 12 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feature: support aggregate metrics and win rate for evaluation #5458

feature: support aggregate metrics and win rate for evaluation #5458

Uh oh!

soksamnanglim commented Jan 6, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

feature: support aggregate metrics and win rate for evaluation #5458

feature: support aggregate metrics and win rate for evaluation #5458

Uh oh!

Conversation

soksamnanglim commented Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

soksamnanglim commented Jan 6, 2026 •

edited

Loading