Skip to content

Conversation

@soksamnanglim
Copy link
Collaborator

@soksamnanglim soksamnanglim commented Jan 6, 2026

Issue #, if available:
No issue number, internal request.
For LLMAsJudgeEvaluator, we want to display

  • aggregate metrics for builtin metric and custom metric evaluations.
  • win rates when both base and custom model evaluations are available.

Description of changes:
Add support for aggregate metrics and win rate calculate in evaluation with sagemaker training jobs..

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Here are screenshots for manual testing for win rates and metrics aggregation.

Thank you for reviewing!
Single custom model aggregate metrics:
Screenshot 2026-01-05 at 6 05 59 PM
Win rate calculations:
Screenshot 2026-01-05 at 6 06 17 PM

Base and Custom model comparison:
image

For testing:
The unit tests all passed.
I also ran the integration tests in tests/integ/train/test_llm_as_judge_evaluator.py and they all passed.

The pytest logs aren't concise and provide a test execution summary:

[01/07/26 23:34:36] INFO     Final Resource Status: Succeeded                         execution.py:1053
                    INFO     Built-in metrics only evaluation        test_llm_as_judge_evaluator.py:303
                             completed successfully                                                    
PASSED
tests/integ/train/test_llm_as_judge_evaluator.py::TestLLMAsJudgeEvaluatorIntegration::test_llm_as_judge
_custom_metrics_only   

tests/integ/train/test_llm_as_judge_evaluator.py: 207 warnings
sagemaker-python-sdk/.venv-local-py311/lib/python3.11/site-packages/sagemaker/train
/evaluate/execution.py:1167: PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dum
p` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https
://errors.pydantic.dev/2.12/migration/
    execution_dict = _clean_unassigned_from_dict(self.dict())

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
============================ 5 passed, 223 warnings in 5741.12s (1:35:41) =============================

@soksamnanglim soksamnanglim marked this pull request as ready for review January 6, 2026 21:58
@papriwal papriwal merged commit efa8d7d into aws:master Jan 8, 2026
7 of 12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants