Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,7 @@ Create experiments with multiple runs using different agents or models. The UI p

## Generating reports

Generate downloadable reports from experiment results:
Generate downloadable reports from experiment results for sharing or tracking progress over time.

```bash
# HTML report (default)
Expand All @@ -67,4 +67,6 @@ npx @opensearch-project/agent-health report -b "My Benchmark" -f pdf -o report.p
npx @opensearch-project/agent-health report -b "My Benchmark" -f json --stdout
```

Reports include judge reasoning, accuracy scores, and improvement suggestions for each test case.
Reports include a summary of each run (agent, model, pass rate, average accuracy), a per-test-case comparison table across runs, the judge's reasoning and improvement suggestions for each evaluation, and full trajectory steps showing what the agent did.

Use `--runs` to include specific runs, or omit it to include all runs in the experiment.
Original file line number Diff line number Diff line change
Expand Up @@ -76,10 +76,11 @@ Sample data IDs start with `demo-` prefix and are read-only.

![Agent Health Dashboard](/docs/images/agent-health/dashboard.png)

The main dashboard displays:
- Active experiments and their status
- Recent evaluation runs
- Quick statistics on pass/fail rates
Agent Health opens to the Leaderboard Overview — an at-a-glance view of agent performance across all experiments, pre-loaded with sample data.

The top section shows a performance trend chart tracking metrics over time. Use the dropdowns to switch between pass rate, cost, tokens, or latency, and adjust the time range (7 days, 30 days, or all time).

The bottom section is a sortable metrics table showing every experiment and agent combination with columns for run count, pass rate, latency, and cost. Click any column header to sort. Click an experiment or agent name to filter the trend chart to just that selection — active filters appear which can be cleared as required. Each row links to the experiment’s detailed runs view.

## Run your first evaluation

Expand Down
10 changes: 10 additions & 0 deletions docs/starlight-docs/src/content/docs/agent-health/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,10 @@ npx @opensearch-project/agent-health@latest

Opens http://localhost:4001 with pre-loaded sample data for exploration.

## See it in action

<video src="https://github.com/opensearch-project/observability-stack/releases/download/v3.6.0-alpha.1/agent-health-demo.mp4" controls></video>

## Who uses Agent Health

- **AI teams** building autonomous agents (RCA, customer support, data analysis)
Expand All @@ -36,6 +40,12 @@ Opens http://localhost:4001 with pre-loaded sample data for exploration.

Agent Health uses a client-server architecture where all clients (UI, CLI) access storage through a unified HTTP API. The server handles agent communication via pluggable connectors and proxies LLM judge calls to AWS Bedrock.

## Agent Health and the Observability Stack

Agent Health is a UI and CLI-based evaluation tool for scoring agent quality through LLM judge comparison, running experiments, and generating reports. By default it stores test cases, experiments, runs, and evaluation results as local files.

When pointed at an OpenSearch cluster, including the one running in the [Observability Stack](/docs/get-started/overview/), Agent Health stores test cases, experiments, runs, and evaluation results in OpenSearch indices instead of local files. If the same cluster is receiving OpenTelemetry traces through the stack pipeline, Agent Health can also read those traces and display them alongside evaluation results, connecting what the agent did with how well it performed.

## Supported connectors

| Connector | Protocol | Description |
Expand Down