Why is there no reference to other evaluation metrics in the documentation? #3044

mattmcc-attest · 2025-09-29T16:09:15Z

mattmcc-attest
Sep 29, 2025

I can see there are other evaluation metrics. How come the docs only reference response_match_score and tool_trajectory_avg_score?

There also seems to be an allowed list - is there a reason some evaluation metrics are not allowed and some not documented?

adk-bot · 2025-09-29T16:10:36Z

adk-bot
Sep 29, 2025
Maintainer

Response from ADK Answering Agent (experimental, answer may be inaccurate)

TLDR: The response_match_score and tool_trajectory_avg_score metrics are highlighted in the documentation because they are the primary, most stable, and straightforward metrics for evaluating core agent behavior against a ground truth. Other metrics you've found are generally newer, more experimental, or designed for more specialized evaluations like safety or LLM-as-a-judge.

That's an excellent observation! Thank you for looking so deeply into the ADK codebase.

The reason the documentation focuses on response_match_score and tool_trajectory_avg_score is that they form the foundation of agent evaluation in ADK. They are:

Stable and Reliable: These metrics provide consistent and easy-to-interpret scores based on a direct comparison to ground-truth data. tool_trajectory_avg_score checks for an exact match of the tool calls [1], while response_match_score uses the ROUGE-1 metric to compare the agent's final answer to an expected answer [2, 3].
Core to Evaluation: They measure two of the most critical aspects of an agent's performance: did it use the right tools in the right order, and did it produce the correct final answer?

You are correct that there are other evaluation metrics available. These often fall into a few categories:

Specialized Evaluators: Some metrics, like safety_v1, are designed for specific, important checks that aren't part of the default agent behavior evaluation [4].
LLM-as-a-Judge Metrics: Metrics such as final_response_match_v2 and rubric_based_final_response_quality_v1 are more advanced, using a separate "judge" model to rate the quality of an agent's response. These are powerful but are more complex to configure and can be considered more experimental [5].
Less Stable Metrics: As you noted in the code, some metrics might be less stable or still under development. The response_evaluation_score, for instance, is noted in the source code as being "not very stable" [2]. For this reason, it is not recommended for general use and is not highlighted in the documentation.

The ALLOWED_CRITERIA list in the AgentEvaluator serves as a safeguard. It ensures that the default evaluator only uses metrics that are considered validated and ready for general use [2]. This helps prevent unexpected behavior from experimental or unstable metrics.

We are always working to improve the ADK and its documentation. As these newer, more advanced evaluation metrics become more stable and widely applicable, we will work on providing more comprehensive documentation for them.

[1] https://github.com/google/adk-python/blob/main/src/google/adk/evaluation/trajectory_evaluator.py
[2] https://github.com/google/adk-python/blob/main/src/google/adk/evaluation/agent_evaluator.py
[3] https://google.github.io/adk-docs/evaluate/
[4] https://github.com/google/adk-python/blob/main/CHANGELOG.md
[5] https://github.com/google/adk-python/blob/main/src/google/adk/evaluation/eval_config.py

0 replies

mattmcc-attest · 2025-09-29T16:21:10Z

mattmcc-attest
Sep 29, 2025
Author

Not sure if you have the answer to this one @ankursharmas as I see you have been building out evaluation metrics?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Why is there no reference to other evaluation metrics in the documentation? #3044

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Why is there no reference to other evaluation metrics in the documentation? #3044

Uh oh!

Uh oh!

mattmcc-attest Sep 29, 2025

Replies: 2 comments

Uh oh!

adk-bot Sep 29, 2025 Maintainer

Uh oh!

mattmcc-attest Sep 29, 2025 Author

mattmcc-attest
Sep 29, 2025

adk-bot
Sep 29, 2025
Maintainer

mattmcc-attest
Sep 29, 2025
Author