Add agent evaluations

# Add agent evaluations

## Overview

Add agent evaluation capabilities so builders can measure and improve the quality of their agent experiences. Evaluations use a judge model to score agent responses against configurable criteria. Builders can run on-demand evaluations with test cases, view detailed result reports with improvement guidance, and optionally enable runtime evaluations that score live invocations to track quality over time.

## Context

### Current State
- The `AgentDetailPage` shows sessions, an invoke panel, latency summary, streamed response, and deployment configuration — but has no evaluation or quality measurement features
- The `Invocation` model stores `prompt_text`, `thinking_text`, and `response_text` for each invocation, which provides the input/output pairs needed for evaluation
- The `ConfigEntry` model stores agent-level key-value configuration and could be used to persist evaluation settings
- The agent handler (`agents/strands_agent/src/handler.py`) processes invocations via `agent.stream_async()` and yields streaming text — runtime evaluation hooks would need to run post-completion
- `SUPPORTED_MODELS` in `agents.py` lists available Bedrock models — a judge model can be selected from this list
- The `InvokePanel` component handles prompt submission and streaming — evaluation test cases would use a similar invocation flow
- No evaluation framework, scoring, or judge model integration exists in the codebase

### Key Files
| File | Role |
|------|------|
| `frontend/src/pages/AgentDetailPage.tsx` | Agent detail with sessions, invoke, and deployment |
| `frontend/src/components/InvokePanel.tsx` | Invoke form and streaming |
| `frontend/src/hooks/useInvoke.ts` | Invocation state management |
| `backend/app/routers/invocations.py` | SSE invocation endpoint, session management |
| `backend/app/routers/agents.py` | Agent CRUD, `SUPPORTED_MODELS` |
| `backend/app/models/invocation.py` | Invocation ORM (prompt, response, timing, tokens) |
| `backend/app/models/config_entry.py` | Agent config key-value pairs |
| `backend/app/models/agent.py` | Agent ORM model |
| `agents/strands_agent/src/handler.py` | Runtime invocation handler |
| `frontend/src/api/types.ts` | Shared TypeScript types |

### Technology Stack
- **Backend**: Python, FastAPI, SQLAlchemy, SQLite
- **Frontend**: TypeScript, React, Vite, shadcn/ui, Tailwind CSS
- **Agent Runtime**: Strands SDK on Bedrock AgentCore Runtime
- **Models**: Amazon Bedrock (Claude, Nova families)

## Requirements

### R1: On-Demand Evaluation Configuration and Execution

Builder should be able to configure on-demand evaluations, run one, and see the results.

- Create new ORM models in `backend/app/models/evaluation.py`:
  - `EvaluationSuite` — a named collection of test cases for an agent:
    - `id` (integer, primary key)
    - `agent_id` (integer, FK to agents, not null)
    - `name` (string, not null) — e.g. "Customer Support Quality"
    - `judge_model_id` (string, not null) — Bedrock model ID used as the judge (from `SUPPORTED_MODELS`)
    - `criteria` (text, JSON array) — list of scoring criteria, each with `name`, `description`, and `weight` (e.g. `[{"name": "relevance", "description": "Is the response relevant to the prompt?", "weight": 1.0}, {"name": "helpfulness", "description": "Does the response help the user accomplish their goal?", "weight": 1.0}]`)
    - `created_at`, `updated_at` (datetime)
  - `EvaluationTestCase` — an individual test case within a suite:
    - `id` (integer, primary key)
    - `suite_id` (integer, FK to evaluation_suites, not null)
    - `name` (string, nullable) — optional label
    - `prompt` (text, not null) — the input prompt to send to the agent
    - `expected_response` (text, nullable) — optional reference answer for comparison
    - `context` (text, nullable) — optional additional context the judge should consider
    - `created_at` (datetime)
  - `EvaluationRun` — a single execution of a suite:
    - `id` (integer, primary key)
    - `suite_id` (integer, FK to evaluation_suites, not null)
    - `qualifier` (string, default "DEFAULT") — agent endpoint qualifier used
    - `status` (string, not null) — `pending`, `running`, `complete`, `error`
    - `started_at`, `completed_at` (datetime)
    - `summary_scores` (text, JSON) — aggregated scores per criterion after completion
  - `EvaluationResult` — per-test-case result within a run:
    - `id` (integer, primary key)
    - `run_id` (integer, FK to evaluation_runs, not null)
    - `test_case_id` (integer, FK to evaluation_test_cases, not null)
    - `invocation_id` (string, nullable) — FK to the invocation created for this test case
    - `agent_response` (text, nullable) — the agent's actual response
    - `scores` (text, JSON) — per-criterion scores, e.g. `{"relevance": 4, "helpfulness": 5}`
    - `judge_reasoning` (text, nullable) — the judge model's explanation for its scores
    - `status` (string) — `pending`, `complete`, `error`
    - `error_message` (text, nullable)
- Create backend endpoints in `backend/app/routers/evaluations.py`:
  - `POST /api/agents/{agent_id}/evaluations/suites` — create an evaluation suite with criteria and test cases
  - `GET /api/agents/{agent_id}/evaluations/suites` — list suites for an agent
  - `GET /api/agents/{agent_id}/evaluations/suites/{suite_id}` — get suite detail including test cases
  - `PUT /api/agents/{agent_id}/evaluations/suites/{suite_id}` — update suite criteria or test cases
  - `DELETE /api/agents/{agent_id}/evaluations/suites/{suite_id}` — delete a suite
  - `POST /api/agents/{agent_id}/evaluations/suites/{suite_id}/run` — trigger an evaluation run
  - `GET /api/agents/{agent_id}/evaluations/runs/{run_id}` — get run status and results
- Create a backend evaluation service (`backend/app/services/evaluation.py`) that orchestrates a run:
  - For each test case in the suite, invoke the agent (reuse the existing invocation pipeline) and capture the response
  - Send each prompt/response pair to the judge model via Bedrock with a scoring prompt that includes the criteria definitions, expected response (if provided), and scoring instructions (rate each criterion 1-5)
  - Parse the judge model's response to extract per-criterion numeric scores and reasoning
  - Store results in `EvaluationResult` and compute aggregated `summary_scores` on the `EvaluationRun`
  - Run evaluations asynchronously (background task) so the API returns immediately with the run ID
- Create frontend components:
  - Add an **Evaluations** section on the `AgentDetailPage` (below sessions, above deployment), visible only for deployed agents
  - `EvaluationSuiteManager` — form to create/edit suites: name, judge model selector (from `SUPPORTED_MODELS`), criteria editor (add/remove criteria with name, description, weight), and test case editor (add/remove test cases with prompt, optional expected response)
  - `EvaluationRunButton` — triggers a run and shows progress (pending/running/complete)
  - `EvaluationResultsView` — displays results after a run completes (see R2)

### R2: Evaluation Report with Improvement Guidance

Builder should get a report of the results and guidance on how to improve various scores.

- The `EvaluationResultsView` component should display:
  - **Summary section** at the top:
    - Overall score (weighted average across all criteria and test cases, displayed as a percentage or out of 5)
    - Per-criterion average scores displayed as a horizontal bar chart or score cards (e.g. "Relevance: 4.2/5", "Helpfulness: 3.8/5")
    - Run metadata: suite name, judge model, qualifier, start/end time, test case count
  - **Per-test-case detail table**:
    - Columns: test case name/prompt (truncated), per-criterion scores, overall score, status
    - Expandable rows showing: full prompt, expected response, agent response, judge reasoning
    - Color-coding: green (4-5), yellow (3), red (1-2) for individual scores
  - **Improvement guidance panel**:
    - For each criterion that scores below a configurable threshold (default: 3.5/5), generate actionable improvement suggestions
    - Create a backend endpoint `GET /api/agents/{agent_id}/evaluations/runs/{run_id}/guidance` that sends the aggregated results (low-scoring criteria, sample low-scoring prompt/response pairs) to the judge model with a meta-prompt asking for specific improvement recommendations
    - Guidance should include concrete suggestions such as:
      - System prompt modifications (e.g. "Add instructions to always cite sources")
      - Missing tool integrations (e.g. "The agent lacks a search tool for factual queries")
      - Model selection (e.g. "Consider using a more capable model for complex reasoning tasks")
      - Test case design (e.g. "Expected response for test case 3 may be too specific")
    - Display guidance as a bulleted list grouped by criterion, with a "Regenerate Guidance" button
- Add an `EvaluationHistory` component that lists past runs for a suite:
  - Table showing run date, overall score, status, and a link to view full results
  - Allow comparison between two runs (side-by-side score diff) to show improvement or regression

### R3: Runtime Evaluations with Dashboard

Builder should be able to optionally enable evaluations for runtime and see dashboards of evaluation scores over time.

- Add a `runtime_eval_enabled` boolean field and `runtime_eval_config` (JSON text) field to the `EvaluationSuite` model:
  - `runtime_eval_config` contains: `sample_rate` (float, 0.0-1.0, default 0.1 — fraction of invocations to evaluate), `judge_model_id` (string), `criteria` (reuses suite criteria)
- When runtime evaluation is enabled for a suite:
  - After each agent invocation completes (in the SSE streaming endpoint in `invocations.py`, after `session_end`), check if the agent has any suites with `runtime_eval_enabled=True`
  - Based on `sample_rate`, probabilistically decide whether to evaluate this invocation
  - If selected, queue a background evaluation task that sends the invocation's prompt/response to the judge model and stores the result as an `EvaluationResult` linked to both the run (a synthetic "runtime" run per day or per batch) and the original invocation
- Create a `RuntimeEvalRun` or extend `EvaluationRun` with a `run_type` field (`on_demand` vs `runtime`) to distinguish manually triggered runs from automated runtime evaluations
- Create backend endpoints:
  - `PUT /api/agents/{agent_id}/evaluations/suites/{suite_id}/runtime` — enable/disable runtime evaluation and configure sample rate
  - `GET /api/agents/{agent_id}/evaluations/runtime/scores` — return time-series evaluation scores:
    - Accepts `start_date`, `end_date`, `granularity` (hourly, daily, weekly) query parameters
    - Returns per-criterion average scores bucketed by time period
- Create frontend components:
  - **Runtime evaluation toggle** on the suite configuration form: switch to enable, sample rate slider (1%-100%), judge model selector
  - **Evaluation Dashboard** section on the `AgentDetailPage` (shown when runtime eval is enabled):
    - Time-series line chart showing per-criterion scores over time (x-axis: date, y-axis: score 1-5, one line per criterion)
    - Time range selector (last 7 days, 30 days, 90 days)
    - Summary statistics: current average vs. previous period, trend indicator (improving/stable/regressing)
    - Drill-down: clicking a data point shows the individual evaluated invocations for that time bucket with their scores
  - **Alert indicators**: if any criterion's rolling average drops below the threshold (default 3.5), show a warning badge on the Evaluations section header and in the dashboard

## Testing

- Run backend tests: `cd backend && make test`
- Run frontend typecheck: `cd frontend && npx tsc --noEmit`
- Verify on-demand evaluation:
  - Create an evaluation suite with 2-3 criteria and 3-5 test cases
  - Trigger a run and verify it progresses through pending -> running -> complete
  - Each test case produces an invocation, agent response, and judge scores
  - Summary scores aggregate correctly across test cases
- Verify evaluation report:
  - Results display per-criterion scores with correct color coding
  - Expanding a test case row shows full prompt, response, and judge reasoning
  - Improvement guidance generates actionable suggestions for low-scoring criteria
  - Evaluation history shows past runs with scores
- Verify runtime evaluation:
  - Enable runtime eval on a suite with a sample rate of 1.0 (100%) for testing
  - Invoke the agent several times and verify evaluation results are created for each invocation
  - The time-series endpoint returns correct score buckets
  - Dashboard chart renders scores over time
  - Reduce sample rate to 0.5 and verify approximately half of invocations are evaluated
  - Disable runtime eval and verify no new evaluations are created
- Database:
  - New tables are created without affecting existing tables
  - Deleting a suite cascades to test cases, runs, and results
  - Deleting an agent cascades to evaluation suites

## Out of Scope

- Custom judge prompts (the scoring prompt is system-defined based on criteria)
- Multi-turn conversation evaluation (each test case is a single prompt/response)
- Evaluation of tool use quality (only the final text response is scored)
- Automated remediation (changing agent config based on eval results)
- Cost tracking for judge model invocations
- Exporting evaluation results (CSV, JSON)
- Comparison across different agents (evaluations are scoped to a single agent)
- Integration with external evaluation frameworks (e.g. RAGAS, DeepEval)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add agent evaluations #42

Add agent evaluations

Overview

Context

Current State

Key Files

Technology Stack

Requirements

R1: On-Demand Evaluation Configuration and Execution

R2: Evaluation Report with Improvement Guidance

R3: Runtime Evaluations with Dashboard

Testing

Out of Scope

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

File	Role
`frontend/src/pages/AgentDetailPage.tsx`	Agent detail with sessions, invoke, and deployment
`frontend/src/components/InvokePanel.tsx`	Invoke form and streaming
`frontend/src/hooks/useInvoke.ts`	Invocation state management
`backend/app/routers/invocations.py`	SSE invocation endpoint, session management
`backend/app/routers/agents.py`	Agent CRUD, `SUPPORTED_MODELS`
`backend/app/models/invocation.py`	Invocation ORM (prompt, response, timing, tokens)
`backend/app/models/config_entry.py`	Agent config key-value pairs
`backend/app/models/agent.py`	Agent ORM model
`agents/strands_agent/src/handler.py`	Runtime invocation handler
`frontend/src/api/types.ts`	Shared TypeScript types

Add agent evaluations #42

Description

Add agent evaluations

Overview

Context

Current State

Key Files

Technology Stack

Requirements

R1: On-Demand Evaluation Configuration and Execution

R2: Evaluation Report with Improvement Guidance

R3: Runtime Evaluations with Dashboard

Testing

Out of Scope

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions