-
Notifications
You must be signed in to change notification settings - Fork 0
Add agent evaluations #42
Copy link
Copy link
Open
Description
Add agent evaluations
Overview
Add agent evaluation capabilities so builders can measure and improve the quality of their agent experiences. Evaluations use a judge model to score agent responses against configurable criteria. Builders can run on-demand evaluations with test cases, view detailed result reports with improvement guidance, and optionally enable runtime evaluations that score live invocations to track quality over time.
Context
Current State
- The
AgentDetailPageshows sessions, an invoke panel, latency summary, streamed response, and deployment configuration — but has no evaluation or quality measurement features - The
Invocationmodel storesprompt_text,thinking_text, andresponse_textfor each invocation, which provides the input/output pairs needed for evaluation - The
ConfigEntrymodel stores agent-level key-value configuration and could be used to persist evaluation settings - The agent handler (
agents/strands_agent/src/handler.py) processes invocations viaagent.stream_async()and yields streaming text — runtime evaluation hooks would need to run post-completion SUPPORTED_MODELSinagents.pylists available Bedrock models — a judge model can be selected from this list- The
InvokePanelcomponent handles prompt submission and streaming — evaluation test cases would use a similar invocation flow - No evaluation framework, scoring, or judge model integration exists in the codebase
Key Files
| File | Role |
|---|---|
frontend/src/pages/AgentDetailPage.tsx |
Agent detail with sessions, invoke, and deployment |
frontend/src/components/InvokePanel.tsx |
Invoke form and streaming |
frontend/src/hooks/useInvoke.ts |
Invocation state management |
backend/app/routers/invocations.py |
SSE invocation endpoint, session management |
backend/app/routers/agents.py |
Agent CRUD, SUPPORTED_MODELS |
backend/app/models/invocation.py |
Invocation ORM (prompt, response, timing, tokens) |
backend/app/models/config_entry.py |
Agent config key-value pairs |
backend/app/models/agent.py |
Agent ORM model |
agents/strands_agent/src/handler.py |
Runtime invocation handler |
frontend/src/api/types.ts |
Shared TypeScript types |
Technology Stack
- Backend: Python, FastAPI, SQLAlchemy, SQLite
- Frontend: TypeScript, React, Vite, shadcn/ui, Tailwind CSS
- Agent Runtime: Strands SDK on Bedrock AgentCore Runtime
- Models: Amazon Bedrock (Claude, Nova families)
Requirements
R1: On-Demand Evaluation Configuration and Execution
Builder should be able to configure on-demand evaluations, run one, and see the results.
- Create new ORM models in
backend/app/models/evaluation.py:EvaluationSuite— a named collection of test cases for an agent:id(integer, primary key)agent_id(integer, FK to agents, not null)name(string, not null) — e.g. "Customer Support Quality"judge_model_id(string, not null) — Bedrock model ID used as the judge (fromSUPPORTED_MODELS)criteria(text, JSON array) — list of scoring criteria, each withname,description, andweight(e.g.[{"name": "relevance", "description": "Is the response relevant to the prompt?", "weight": 1.0}, {"name": "helpfulness", "description": "Does the response help the user accomplish their goal?", "weight": 1.0}])created_at,updated_at(datetime)
EvaluationTestCase— an individual test case within a suite:id(integer, primary key)suite_id(integer, FK to evaluation_suites, not null)name(string, nullable) — optional labelprompt(text, not null) — the input prompt to send to the agentexpected_response(text, nullable) — optional reference answer for comparisoncontext(text, nullable) — optional additional context the judge should considercreated_at(datetime)
EvaluationRun— a single execution of a suite:id(integer, primary key)suite_id(integer, FK to evaluation_suites, not null)qualifier(string, default "DEFAULT") — agent endpoint qualifier usedstatus(string, not null) —pending,running,complete,errorstarted_at,completed_at(datetime)summary_scores(text, JSON) — aggregated scores per criterion after completion
EvaluationResult— per-test-case result within a run:id(integer, primary key)run_id(integer, FK to evaluation_runs, not null)test_case_id(integer, FK to evaluation_test_cases, not null)invocation_id(string, nullable) — FK to the invocation created for this test caseagent_response(text, nullable) — the agent's actual responsescores(text, JSON) — per-criterion scores, e.g.{"relevance": 4, "helpfulness": 5}judge_reasoning(text, nullable) — the judge model's explanation for its scoresstatus(string) —pending,complete,errorerror_message(text, nullable)
- Create backend endpoints in
backend/app/routers/evaluations.py:POST /api/agents/{agent_id}/evaluations/suites— create an evaluation suite with criteria and test casesGET /api/agents/{agent_id}/evaluations/suites— list suites for an agentGET /api/agents/{agent_id}/evaluations/suites/{suite_id}— get suite detail including test casesPUT /api/agents/{agent_id}/evaluations/suites/{suite_id}— update suite criteria or test casesDELETE /api/agents/{agent_id}/evaluations/suites/{suite_id}— delete a suitePOST /api/agents/{agent_id}/evaluations/suites/{suite_id}/run— trigger an evaluation runGET /api/agents/{agent_id}/evaluations/runs/{run_id}— get run status and results
- Create a backend evaluation service (
backend/app/services/evaluation.py) that orchestrates a run:- For each test case in the suite, invoke the agent (reuse the existing invocation pipeline) and capture the response
- Send each prompt/response pair to the judge model via Bedrock with a scoring prompt that includes the criteria definitions, expected response (if provided), and scoring instructions (rate each criterion 1-5)
- Parse the judge model's response to extract per-criterion numeric scores and reasoning
- Store results in
EvaluationResultand compute aggregatedsummary_scoreson theEvaluationRun - Run evaluations asynchronously (background task) so the API returns immediately with the run ID
- Create frontend components:
- Add an Evaluations section on the
AgentDetailPage(below sessions, above deployment), visible only for deployed agents EvaluationSuiteManager— form to create/edit suites: name, judge model selector (fromSUPPORTED_MODELS), criteria editor (add/remove criteria with name, description, weight), and test case editor (add/remove test cases with prompt, optional expected response)EvaluationRunButton— triggers a run and shows progress (pending/running/complete)EvaluationResultsView— displays results after a run completes (see R2)
- Add an Evaluations section on the
R2: Evaluation Report with Improvement Guidance
Builder should get a report of the results and guidance on how to improve various scores.
- The
EvaluationResultsViewcomponent should display:- Summary section at the top:
- Overall score (weighted average across all criteria and test cases, displayed as a percentage or out of 5)
- Per-criterion average scores displayed as a horizontal bar chart or score cards (e.g. "Relevance: 4.2/5", "Helpfulness: 3.8/5")
- Run metadata: suite name, judge model, qualifier, start/end time, test case count
- Per-test-case detail table:
- Columns: test case name/prompt (truncated), per-criterion scores, overall score, status
- Expandable rows showing: full prompt, expected response, agent response, judge reasoning
- Color-coding: green (4-5), yellow (3), red (1-2) for individual scores
- Improvement guidance panel:
- For each criterion that scores below a configurable threshold (default: 3.5/5), generate actionable improvement suggestions
- Create a backend endpoint
GET /api/agents/{agent_id}/evaluations/runs/{run_id}/guidancethat sends the aggregated results (low-scoring criteria, sample low-scoring prompt/response pairs) to the judge model with a meta-prompt asking for specific improvement recommendations - Guidance should include concrete suggestions such as:
- System prompt modifications (e.g. "Add instructions to always cite sources")
- Missing tool integrations (e.g. "The agent lacks a search tool for factual queries")
- Model selection (e.g. "Consider using a more capable model for complex reasoning tasks")
- Test case design (e.g. "Expected response for test case 3 may be too specific")
- Display guidance as a bulleted list grouped by criterion, with a "Regenerate Guidance" button
- Summary section at the top:
- Add an
EvaluationHistorycomponent that lists past runs for a suite:- Table showing run date, overall score, status, and a link to view full results
- Allow comparison between two runs (side-by-side score diff) to show improvement or regression
R3: Runtime Evaluations with Dashboard
Builder should be able to optionally enable evaluations for runtime and see dashboards of evaluation scores over time.
- Add a
runtime_eval_enabledboolean field andruntime_eval_config(JSON text) field to theEvaluationSuitemodel:runtime_eval_configcontains:sample_rate(float, 0.0-1.0, default 0.1 — fraction of invocations to evaluate),judge_model_id(string),criteria(reuses suite criteria)
- When runtime evaluation is enabled for a suite:
- After each agent invocation completes (in the SSE streaming endpoint in
invocations.py, aftersession_end), check if the agent has any suites withruntime_eval_enabled=True - Based on
sample_rate, probabilistically decide whether to evaluate this invocation - If selected, queue a background evaluation task that sends the invocation's prompt/response to the judge model and stores the result as an
EvaluationResultlinked to both the run (a synthetic "runtime" run per day or per batch) and the original invocation
- After each agent invocation completes (in the SSE streaming endpoint in
- Create a
RuntimeEvalRunor extendEvaluationRunwith arun_typefield (on_demandvsruntime) to distinguish manually triggered runs from automated runtime evaluations - Create backend endpoints:
PUT /api/agents/{agent_id}/evaluations/suites/{suite_id}/runtime— enable/disable runtime evaluation and configure sample rateGET /api/agents/{agent_id}/evaluations/runtime/scores— return time-series evaluation scores:- Accepts
start_date,end_date,granularity(hourly, daily, weekly) query parameters - Returns per-criterion average scores bucketed by time period
- Accepts
- Create frontend components:
- Runtime evaluation toggle on the suite configuration form: switch to enable, sample rate slider (1%-100%), judge model selector
- Evaluation Dashboard section on the
AgentDetailPage(shown when runtime eval is enabled):- Time-series line chart showing per-criterion scores over time (x-axis: date, y-axis: score 1-5, one line per criterion)
- Time range selector (last 7 days, 30 days, 90 days)
- Summary statistics: current average vs. previous period, trend indicator (improving/stable/regressing)
- Drill-down: clicking a data point shows the individual evaluated invocations for that time bucket with their scores
- Alert indicators: if any criterion's rolling average drops below the threshold (default 3.5), show a warning badge on the Evaluations section header and in the dashboard
Testing
- Run backend tests:
cd backend && make test - Run frontend typecheck:
cd frontend && npx tsc --noEmit - Verify on-demand evaluation:
- Create an evaluation suite with 2-3 criteria and 3-5 test cases
- Trigger a run and verify it progresses through pending -> running -> complete
- Each test case produces an invocation, agent response, and judge scores
- Summary scores aggregate correctly across test cases
- Verify evaluation report:
- Results display per-criterion scores with correct color coding
- Expanding a test case row shows full prompt, response, and judge reasoning
- Improvement guidance generates actionable suggestions for low-scoring criteria
- Evaluation history shows past runs with scores
- Verify runtime evaluation:
- Enable runtime eval on a suite with a sample rate of 1.0 (100%) for testing
- Invoke the agent several times and verify evaluation results are created for each invocation
- The time-series endpoint returns correct score buckets
- Dashboard chart renders scores over time
- Reduce sample rate to 0.5 and verify approximately half of invocations are evaluated
- Disable runtime eval and verify no new evaluations are created
- Database:
- New tables are created without affecting existing tables
- Deleting a suite cascades to test cases, runs, and results
- Deleting an agent cascades to evaluation suites
Out of Scope
- Custom judge prompts (the scoring prompt is system-defined based on criteria)
- Multi-turn conversation evaluation (each test case is a single prompt/response)
- Evaluation of tool use quality (only the final text response is scored)
- Automated remediation (changing agent config based on eval results)
- Cost tracking for judge model invocations
- Exporting evaluation results (CSV, JSON)
- Comparison across different agents (evaluations are scoped to a single agent)
- Integration with external evaluation frameworks (e.g. RAGAS, DeepEval)
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request