AgentCore evaluations let you measure agent quality using LLM-as-a-Judge. Define custom evaluators with scoring rubrics, run them on-demand against historical traces, or deploy online eval configs that automatically sample and score live traffic.
| Concept | Description |
|---|---|
| Evaluator | A custom LLM judge defined in your project with instructions, model, and rating scale |
| On-demand eval | One-off evaluation run against historical agent traces within a lookback window |
| Online eval | Continuous evaluation that samples a percentage of live agent requests |
| Builtin evaluator | Pre-built evaluators provided by AgentCore (e.g. Builtin.Faithfulness) |
| Evaluation level | Granularity of evaluation: SESSION, TRACE, or TOOL_CALL |
| Level | Description |
|---|---|
SESSION |
Overall quality across an entire conversation |
TRACE |
Per-turn accuracy of individual agent responses |
TOOL_CALL |
Correctness of individual tool selections and usage |
Scores range from 0 (worst) to 1 (best), normalized from the rating scale you define. For example, a score of 3 on
a 1–5 numerical scale produces a normalized score of 0.60.
# Interactive (TUI wizard)
agentcore add evaluator
# Non-interactive
agentcore add evaluator \
--name ResponseQuality \
--level SESSION \
--model us.anthropic.claude-sonnet-4-5-20250929-v1:0 \
--instructions "Evaluate the agent response quality. Context: {context}" \
--rating-scale 1-5-quality| Flag | Description |
|---|---|
--name <name> |
Evaluator name (alphanumeric + underscore, max 48 chars) |
--level <level> |
Evaluation level: SESSION, TRACE, TOOL_CALL |
--model <model> |
Bedrock model ID for the LLM judge |
--instructions <text> |
Evaluation prompt (must include level-appropriate placeholders — see below) |
--rating-scale <preset> |
Rating scale preset or custom format (default: 1-5-quality) |
--config <path> |
Path to evaluator config JSON (overrides --model, --instructions, --rating-scale) |
--json |
JSON output |
Note:
--instructionsis required in non-interactive mode unless--configis provided.
Instructions must include at least one placeholder appropriate for the evaluation level. Placeholders are replaced with actual data at evaluation time.
| Placeholder | Available At | Description |
|---|---|---|
{context} |
SESSION, TRACE, TOOL_CALL | Full conversation history (user + assistant messages) |
{assistant_turn} |
TRACE | The specific assistant response being evaluated |
{available_tools} |
SESSION, TOOL_CALL | List of tools the agent can call |
{tool_turn} |
TOOL_CALL | The specific tool call and its result |
Example instructions by level:
# SESSION
Evaluate whether the agent fulfilled the user's request. Context: {context}
# TRACE
Rate the accuracy of this response. Context: {context}. Assistant turn: {assistant_turn}
# TOOL_CALL
Evaluate whether the correct tool was selected. Context: {context}. Tool turn: {tool_turn}
| Preset ID | Type | Values |
|---|---|---|
1-5-quality |
Numerical | Poor(1), Fair(2), Good(3), Very Good(4), Excellent(5) |
1-3-simple |
Numerical | Low(1), Medium(2), High(3) |
pass-fail |
Categorical | Pass, Fail |
good-neutral-bad |
Categorical | Good, Neutral, Bad |
You can also provide a custom scale inline:
# Custom numerical
--rating-scale "1:Bad:Fails criteria, 2:OK:Meets criteria, 3:Great:Exceeds criteria"
# Custom categorical
--rating-scale "Relevant:On topic and useful, Irrelevant:Off topic or unhelpful"Evaluators are stored in the evaluators array of agentcore.json:
{
"evaluators": [
{
"type": "CustomEvaluator",
"name": "ResponseQuality",
"level": "SESSION",
"config": {
"llmAsAJudge": {
"model": "us.anthropic.claude-sonnet-4-5-20250929-v1:0",
"instructions": "Evaluate the agent response quality. Context: {context}",
"ratingScale": {
"numerical": [
{ "value": 1, "label": "Poor", "definition": "Fails to meet expectations" },
{ "value": 2, "label": "Fair", "definition": "Partially meets expectations" },
{ "value": 3, "label": "Good", "definition": "Meets expectations" },
{ "value": 4, "label": "Very Good", "definition": "Exceeds expectations" },
{ "value": 5, "label": "Excellent", "definition": "Far exceeds expectations" }
]
}
}
}
}
]
}Model availability varies by AWS region. Recommended models:
| Model | Description |
|---|---|
| Claude Sonnet 4.5 | Recommended — balanced speed and accuracy |
| Claude Opus 4.5 | Most capable — best for complex evaluations |
| Claude Haiku 4.5 | Fastest — good for high-volume evaluations |
| Amazon Nova Pro | Strong reasoning |
| Amazon Nova Lite | Fast and cost-effective |
Run evaluators against historical agent traces.
# Project mode — evaluate a project agent
agentcore run evals \
--agent MyAgent \
--evaluator ResponseQuality \
--days 7
# Standalone mode — evaluate any agent by ARN
agentcore run evals \
--agent-arn arn:aws:bedrock-agentcore:us-east-1:123456789012:runtime/abc123 \
--evaluator-arn arn:aws:bedrock-agentcore:us-east-1:123456789012:evaluator/eval123 \
--region us-east-1
# Multiple evaluators
agentcore run evals \
--agent MyAgent \
--evaluator ResponseQuality Builtin.Faithfulness \
--days 14
# Target specific session or trace
agentcore run evals \
--agent MyAgent \
--evaluator ResponseQuality \
--session-id abc123 \
--days 7| Flag | Description |
|---|---|
-a, --agent <name> |
Agent name from project config |
--agent-arn <arn> |
Agent runtime ARN (standalone mode, no project required) |
-e, --evaluator <names...> |
Evaluator name(s) from project or Builtin.* IDs |
--evaluator-arn <arns...> |
Evaluator ARN(s) (use with --agent-arn) |
--region <region> |
AWS region (required with --agent-arn) |
-s, --session-id <id> |
Evaluate a specific session only |
-t, --trace-id <id> |
Evaluate a specific trace only |
--days <days> |
Lookback window in days (default: 7) |
--output <path> |
Custom output file path |
--json |
JSON output |
Note: Traces may take 5–10 minutes to appear after agent invocations. If a run returns no sessions, try increasing
--daysor waiting for traces to propagate.
In the TUI (agentcore → Evals → Run Evaluation), the wizard walks you through:
- Select agent (or enter ARN)
- Choose evaluator(s)
- Set lookback window
- Select sessions to evaluate
- Confirm and run
Results are saved locally and can be viewed in the TUI or CLI:
# CLI table view
agentcore evals history
# Filter by agent
agentcore evals history --agent MyAgent
# JSON output
agentcore evals history --json --limit 10| Flag | Description |
|---|---|
-a, --agent <name> |
Filter by agent name |
-n, --limit <count> |
Max number of runs to display |
--json |
JSON output |
Results are stored in agentcore/.cli/eval-runs/ within your project directory.
Online eval configs automatically sample and evaluate a percentage of live agent requests after deployment.
# Interactive
agentcore add online-eval
# Non-interactive
agentcore add online-eval \
--name QualityMonitor \
--agent MyAgent \
--evaluator ResponseQuality Builtin.Faithfulness \
--sampling-rate 10| Flag | Description |
|---|---|
--name <name> |
Config name (alphanumeric + underscore, max 48 chars) |
-a, --agent <name> |
Agent to monitor |
-e, --evaluator <names...> |
Evaluator name(s), Builtin.* IDs, or ARNs |
--evaluator-arn <arns...> |
Evaluator ARN(s) |
--sampling-rate <rate> |
Percentage of requests to evaluate (0.01–100) |
--enable-on-create |
Enable immediately after deploy |
--json |
JSON output |
The sampling rate controls what percentage of agent requests are evaluated. Higher rates give better coverage but increase LLM costs from evaluator invocations.
| Rate | Use Case |
|---|---|
| 1–5% | Production monitoring, cost-sensitive |
| 10–25% | Development and staging |
| 100% | Full coverage during testing |
Online eval configs are stored in the onlineEvalConfigs array of agentcore.json:
{
"onlineEvalConfigs": [
{
"type": "OnlineEvaluationConfig",
"name": "QualityMonitor",
"agent": "MyAgent",
"evaluators": ["ResponseQuality", "Builtin.Faithfulness"],
"samplingRate": 10,
"enableOnCreate": true
}
]
}Run agentcore deploy to create or update the online eval config in AWS.
# Pause by name (requires project)
agentcore pause online-eval QualityMonitor
# Resume by name
agentcore resume online-eval QualityMonitor
# Pause by ARN (no project required)
agentcore pause online-eval --arn arn:aws:bedrock-agentcore:us-east-1:123456789012:online-eval-config/abc123
# Resume by ARN
agentcore resume online-eval --arn arn:aws:bedrock-agentcore:us-east-1:123456789012:online-eval-config/abc123| Flag | Description |
|---|---|
[name] |
Config name from project (not needed with --arn) |
--arn <arn> |
Online eval config ARN (standalone mode) |
--region <region> |
AWS region override |
--json |
JSON output |
The TUI provides a dashboard for monitoring online eval results (agentcore → Evals → Online Eval Dashboard).
Note: Evaluation results may take 5–10 minutes to appear after agent invocations.
# Stream logs in real-time
agentcore logs evals
# Search historical logs
agentcore logs evals --agent MyAgent --since 1h
# JSON output
agentcore logs evals --json --lines 100| Flag | Description |
|---|---|
-a, --agent <name> |
Filter by agent |
--since <time> |
Start time (e.g. 1h, 30m, 2d, ISO 8601) |
--until <time> |
End time (e.g. now, ISO 8601) |
-n, --lines <count> |
Maximum number of log lines |
-f, --follow |
Stream logs in real-time |
--json |
JSON Lines output |
# Remove an evaluator
agentcore remove evaluator --name ResponseQuality
# Remove an online eval config
agentcore remove online-eval --name QualityMonitorNote: You cannot remove an evaluator that is referenced by an online eval config. Remove the online eval config reference first.
AgentCore provides pre-built evaluators that can be used without creating custom evaluator definitions. Reference them
by their Builtin.* ID in --evaluator flags or in online eval config evaluators arrays.
agentcore run evals --agent MyAgent --evaluator Builtin.Faithfulness# Run evals and fail pipeline if score < threshold
result=$(agentcore run evals --agent MyAgent --evaluator ResponseQuality --days 1 --json)
score=$(echo "$result" | jq '.run.results[0].aggregateScore')
if (( $(echo "$score < 0.7" | bc -l) )); then
echo "Quality gate failed: score $score < 0.7"
exit 1
fi# 1. Create evaluator
agentcore add evaluator \
--name ResponseQuality \
--level SESSION \
--model us.anthropic.claude-sonnet-4-5-20250929-v1:0 \
--instructions "Evaluate the agent response quality. Context: {context}"
# 2. Run on-demand eval to verify
agentcore run evals --agent MyAgent --evaluator ResponseQuality --days 7
# 3. Set up continuous monitoring
agentcore add online-eval \
--name QualityMonitor \
--agent MyAgent \
--evaluator ResponseQuality \
--sampling-rate 10
# 4. Deploy
agentcore deployEvaluate agents and use evaluators outside of a project directory using ARNs:
agentcore run evals \
--agent-arn arn:aws:bedrock-agentcore:us-east-1:123456789012:runtime/my-agent \
--evaluator-arn arn:aws:bedrock-agentcore:us-east-1:123456789012:evaluator/my-eval \
--region us-east-1 \
--days 7
agentcore pause online-eval \
--arn arn:aws:bedrock-agentcore:us-east-1:123456789012:online-eval-config/my-config