Skip to content

Add agent evaluations #42

@heeki

Description

@heeki

Add agent evaluations

Overview

Add agent evaluation capabilities so builders can measure and improve the quality of their agent experiences. Evaluations use a judge model to score agent responses against configurable criteria. Builders can run on-demand evaluations with test cases, view detailed result reports with improvement guidance, and optionally enable runtime evaluations that score live invocations to track quality over time.

Context

Current State

  • The AgentDetailPage shows sessions, an invoke panel, latency summary, streamed response, and deployment configuration — but has no evaluation or quality measurement features
  • The Invocation model stores prompt_text, thinking_text, and response_text for each invocation, which provides the input/output pairs needed for evaluation
  • The ConfigEntry model stores agent-level key-value configuration and could be used to persist evaluation settings
  • The agent handler (agents/strands_agent/src/handler.py) processes invocations via agent.stream_async() and yields streaming text — runtime evaluation hooks would need to run post-completion
  • SUPPORTED_MODELS in agents.py lists available Bedrock models — a judge model can be selected from this list
  • The InvokePanel component handles prompt submission and streaming — evaluation test cases would use a similar invocation flow
  • No evaluation framework, scoring, or judge model integration exists in the codebase

Key Files

File Role
frontend/src/pages/AgentDetailPage.tsx Agent detail with sessions, invoke, and deployment
frontend/src/components/InvokePanel.tsx Invoke form and streaming
frontend/src/hooks/useInvoke.ts Invocation state management
backend/app/routers/invocations.py SSE invocation endpoint, session management
backend/app/routers/agents.py Agent CRUD, SUPPORTED_MODELS
backend/app/models/invocation.py Invocation ORM (prompt, response, timing, tokens)
backend/app/models/config_entry.py Agent config key-value pairs
backend/app/models/agent.py Agent ORM model
agents/strands_agent/src/handler.py Runtime invocation handler
frontend/src/api/types.ts Shared TypeScript types

Technology Stack

  • Backend: Python, FastAPI, SQLAlchemy, SQLite
  • Frontend: TypeScript, React, Vite, shadcn/ui, Tailwind CSS
  • Agent Runtime: Strands SDK on Bedrock AgentCore Runtime
  • Models: Amazon Bedrock (Claude, Nova families)

Requirements

R1: On-Demand Evaluation Configuration and Execution

Builder should be able to configure on-demand evaluations, run one, and see the results.

  • Create new ORM models in backend/app/models/evaluation.py:
    • EvaluationSuite — a named collection of test cases for an agent:
      • id (integer, primary key)
      • agent_id (integer, FK to agents, not null)
      • name (string, not null) — e.g. "Customer Support Quality"
      • judge_model_id (string, not null) — Bedrock model ID used as the judge (from SUPPORTED_MODELS)
      • criteria (text, JSON array) — list of scoring criteria, each with name, description, and weight (e.g. [{"name": "relevance", "description": "Is the response relevant to the prompt?", "weight": 1.0}, {"name": "helpfulness", "description": "Does the response help the user accomplish their goal?", "weight": 1.0}])
      • created_at, updated_at (datetime)
    • EvaluationTestCase — an individual test case within a suite:
      • id (integer, primary key)
      • suite_id (integer, FK to evaluation_suites, not null)
      • name (string, nullable) — optional label
      • prompt (text, not null) — the input prompt to send to the agent
      • expected_response (text, nullable) — optional reference answer for comparison
      • context (text, nullable) — optional additional context the judge should consider
      • created_at (datetime)
    • EvaluationRun — a single execution of a suite:
      • id (integer, primary key)
      • suite_id (integer, FK to evaluation_suites, not null)
      • qualifier (string, default "DEFAULT") — agent endpoint qualifier used
      • status (string, not null) — pending, running, complete, error
      • started_at, completed_at (datetime)
      • summary_scores (text, JSON) — aggregated scores per criterion after completion
    • EvaluationResult — per-test-case result within a run:
      • id (integer, primary key)
      • run_id (integer, FK to evaluation_runs, not null)
      • test_case_id (integer, FK to evaluation_test_cases, not null)
      • invocation_id (string, nullable) — FK to the invocation created for this test case
      • agent_response (text, nullable) — the agent's actual response
      • scores (text, JSON) — per-criterion scores, e.g. {"relevance": 4, "helpfulness": 5}
      • judge_reasoning (text, nullable) — the judge model's explanation for its scores
      • status (string) — pending, complete, error
      • error_message (text, nullable)
  • Create backend endpoints in backend/app/routers/evaluations.py:
    • POST /api/agents/{agent_id}/evaluations/suites — create an evaluation suite with criteria and test cases
    • GET /api/agents/{agent_id}/evaluations/suites — list suites for an agent
    • GET /api/agents/{agent_id}/evaluations/suites/{suite_id} — get suite detail including test cases
    • PUT /api/agents/{agent_id}/evaluations/suites/{suite_id} — update suite criteria or test cases
    • DELETE /api/agents/{agent_id}/evaluations/suites/{suite_id} — delete a suite
    • POST /api/agents/{agent_id}/evaluations/suites/{suite_id}/run — trigger an evaluation run
    • GET /api/agents/{agent_id}/evaluations/runs/{run_id} — get run status and results
  • Create a backend evaluation service (backend/app/services/evaluation.py) that orchestrates a run:
    • For each test case in the suite, invoke the agent (reuse the existing invocation pipeline) and capture the response
    • Send each prompt/response pair to the judge model via Bedrock with a scoring prompt that includes the criteria definitions, expected response (if provided), and scoring instructions (rate each criterion 1-5)
    • Parse the judge model's response to extract per-criterion numeric scores and reasoning
    • Store results in EvaluationResult and compute aggregated summary_scores on the EvaluationRun
    • Run evaluations asynchronously (background task) so the API returns immediately with the run ID
  • Create frontend components:
    • Add an Evaluations section on the AgentDetailPage (below sessions, above deployment), visible only for deployed agents
    • EvaluationSuiteManager — form to create/edit suites: name, judge model selector (from SUPPORTED_MODELS), criteria editor (add/remove criteria with name, description, weight), and test case editor (add/remove test cases with prompt, optional expected response)
    • EvaluationRunButton — triggers a run and shows progress (pending/running/complete)
    • EvaluationResultsView — displays results after a run completes (see R2)

R2: Evaluation Report with Improvement Guidance

Builder should get a report of the results and guidance on how to improve various scores.

  • The EvaluationResultsView component should display:
    • Summary section at the top:
      • Overall score (weighted average across all criteria and test cases, displayed as a percentage or out of 5)
      • Per-criterion average scores displayed as a horizontal bar chart or score cards (e.g. "Relevance: 4.2/5", "Helpfulness: 3.8/5")
      • Run metadata: suite name, judge model, qualifier, start/end time, test case count
    • Per-test-case detail table:
      • Columns: test case name/prompt (truncated), per-criterion scores, overall score, status
      • Expandable rows showing: full prompt, expected response, agent response, judge reasoning
      • Color-coding: green (4-5), yellow (3), red (1-2) for individual scores
    • Improvement guidance panel:
      • For each criterion that scores below a configurable threshold (default: 3.5/5), generate actionable improvement suggestions
      • Create a backend endpoint GET /api/agents/{agent_id}/evaluations/runs/{run_id}/guidance that sends the aggregated results (low-scoring criteria, sample low-scoring prompt/response pairs) to the judge model with a meta-prompt asking for specific improvement recommendations
      • Guidance should include concrete suggestions such as:
        • System prompt modifications (e.g. "Add instructions to always cite sources")
        • Missing tool integrations (e.g. "The agent lacks a search tool for factual queries")
        • Model selection (e.g. "Consider using a more capable model for complex reasoning tasks")
        • Test case design (e.g. "Expected response for test case 3 may be too specific")
      • Display guidance as a bulleted list grouped by criterion, with a "Regenerate Guidance" button
  • Add an EvaluationHistory component that lists past runs for a suite:
    • Table showing run date, overall score, status, and a link to view full results
    • Allow comparison between two runs (side-by-side score diff) to show improvement or regression

R3: Runtime Evaluations with Dashboard

Builder should be able to optionally enable evaluations for runtime and see dashboards of evaluation scores over time.

  • Add a runtime_eval_enabled boolean field and runtime_eval_config (JSON text) field to the EvaluationSuite model:
    • runtime_eval_config contains: sample_rate (float, 0.0-1.0, default 0.1 — fraction of invocations to evaluate), judge_model_id (string), criteria (reuses suite criteria)
  • When runtime evaluation is enabled for a suite:
    • After each agent invocation completes (in the SSE streaming endpoint in invocations.py, after session_end), check if the agent has any suites with runtime_eval_enabled=True
    • Based on sample_rate, probabilistically decide whether to evaluate this invocation
    • If selected, queue a background evaluation task that sends the invocation's prompt/response to the judge model and stores the result as an EvaluationResult linked to both the run (a synthetic "runtime" run per day or per batch) and the original invocation
  • Create a RuntimeEvalRun or extend EvaluationRun with a run_type field (on_demand vs runtime) to distinguish manually triggered runs from automated runtime evaluations
  • Create backend endpoints:
    • PUT /api/agents/{agent_id}/evaluations/suites/{suite_id}/runtime — enable/disable runtime evaluation and configure sample rate
    • GET /api/agents/{agent_id}/evaluations/runtime/scores — return time-series evaluation scores:
      • Accepts start_date, end_date, granularity (hourly, daily, weekly) query parameters
      • Returns per-criterion average scores bucketed by time period
  • Create frontend components:
    • Runtime evaluation toggle on the suite configuration form: switch to enable, sample rate slider (1%-100%), judge model selector
    • Evaluation Dashboard section on the AgentDetailPage (shown when runtime eval is enabled):
      • Time-series line chart showing per-criterion scores over time (x-axis: date, y-axis: score 1-5, one line per criterion)
      • Time range selector (last 7 days, 30 days, 90 days)
      • Summary statistics: current average vs. previous period, trend indicator (improving/stable/regressing)
      • Drill-down: clicking a data point shows the individual evaluated invocations for that time bucket with their scores
    • Alert indicators: if any criterion's rolling average drops below the threshold (default 3.5), show a warning badge on the Evaluations section header and in the dashboard

Testing

  • Run backend tests: cd backend && make test
  • Run frontend typecheck: cd frontend && npx tsc --noEmit
  • Verify on-demand evaluation:
    • Create an evaluation suite with 2-3 criteria and 3-5 test cases
    • Trigger a run and verify it progresses through pending -> running -> complete
    • Each test case produces an invocation, agent response, and judge scores
    • Summary scores aggregate correctly across test cases
  • Verify evaluation report:
    • Results display per-criterion scores with correct color coding
    • Expanding a test case row shows full prompt, response, and judge reasoning
    • Improvement guidance generates actionable suggestions for low-scoring criteria
    • Evaluation history shows past runs with scores
  • Verify runtime evaluation:
    • Enable runtime eval on a suite with a sample rate of 1.0 (100%) for testing
    • Invoke the agent several times and verify evaluation results are created for each invocation
    • The time-series endpoint returns correct score buckets
    • Dashboard chart renders scores over time
    • Reduce sample rate to 0.5 and verify approximately half of invocations are evaluated
    • Disable runtime eval and verify no new evaluations are created
  • Database:
    • New tables are created without affecting existing tables
    • Deleting a suite cascades to test cases, runs, and results
    • Deleting an agent cascades to evaluation suites

Out of Scope

  • Custom judge prompts (the scoring prompt is system-defined based on criteria)
  • Multi-turn conversation evaluation (each test case is a single prompt/response)
  • Evaluation of tool use quality (only the final text response is scored)
  • Automated remediation (changing agent config based on eval results)
  • Cost tracking for judge model invocations
  • Exporting evaluation results (CSV, JSON)
  • Comparison across different agents (evaluations are scoped to a single agent)
  • Integration with external evaluation frameworks (e.g. RAGAS, DeepEval)

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions