diff --git a/docs/starlight-docs/src/content/docs/agent-health/evaluations/experiments.md b/docs/starlight-docs/src/content/docs/agent-health/evaluations/experiments.md index 5c37ceb5..09ec3cf5 100644 --- a/docs/starlight-docs/src/content/docs/agent-health/evaluations/experiments.md +++ b/docs/starlight-docs/src/content/docs/agent-health/evaluations/experiments.md @@ -54,7 +54,7 @@ Create experiments with multiple runs using different agents or models. The UI p ## Generating reports -Generate downloadable reports from experiment results: +Generate downloadable reports from experiment results for sharing or tracking progress over time. ```bash # HTML report (default) @@ -67,4 +67,6 @@ npx @opensearch-project/agent-health report -b "My Benchmark" -f pdf -o report.p npx @opensearch-project/agent-health report -b "My Benchmark" -f json --stdout ``` -Reports include judge reasoning, accuracy scores, and improvement suggestions for each test case. +Reports include a summary of each run (agent, model, pass rate, average accuracy), a per-test-case comparison table across runs, the judge's reasoning and improvement suggestions for each evaluation, and full trajectory steps showing what the agent did. + +Use `--runs` to include specific runs, or omit it to include all runs in the experiment. diff --git a/docs/starlight-docs/src/content/docs/agent-health/getting-started.md b/docs/starlight-docs/src/content/docs/agent-health/getting-started.md index 05999ad7..80e7ba6f 100644 --- a/docs/starlight-docs/src/content/docs/agent-health/getting-started.md +++ b/docs/starlight-docs/src/content/docs/agent-health/getting-started.md @@ -76,10 +76,11 @@ Sample data IDs start with `demo-` prefix and are read-only. ![Agent Health Dashboard](/docs/images/agent-health/dashboard.png) -The main dashboard displays: -- Active experiments and their status -- Recent evaluation runs -- Quick statistics on pass/fail rates +Agent Health opens to the Leaderboard Overview — an at-a-glance view of agent performance across all experiments, pre-loaded with sample data. + +The top section shows a performance trend chart tracking metrics over time. Use the dropdowns to switch between pass rate, cost, tokens, or latency, and adjust the time range (7 days, 30 days, or all time). + +The bottom section is a sortable metrics table showing every experiment and agent combination with columns for run count, pass rate, latency, and cost. Click any column header to sort. Click an experiment or agent name to filter the trend chart to just that selection — active filters appear which can be cleared as required. Each row links to the experiment’s detailed runs view. ## Run your first evaluation diff --git a/docs/starlight-docs/src/content/docs/agent-health/index.md b/docs/starlight-docs/src/content/docs/agent-health/index.md index 9fb2bbf4..3d53ea09 100644 --- a/docs/starlight-docs/src/content/docs/agent-health/index.md +++ b/docs/starlight-docs/src/content/docs/agent-health/index.md @@ -16,6 +16,10 @@ npx @opensearch-project/agent-health@latest Opens http://localhost:4001 with pre-loaded sample data for exploration. +## See it in action + + + ## Who uses Agent Health - **AI teams** building autonomous agents (RCA, customer support, data analysis) @@ -36,6 +40,12 @@ Opens http://localhost:4001 with pre-loaded sample data for exploration. Agent Health uses a client-server architecture where all clients (UI, CLI) access storage through a unified HTTP API. The server handles agent communication via pluggable connectors and proxies LLM judge calls to AWS Bedrock. +## Agent Health and the Observability Stack + +Agent Health is a UI and CLI-based evaluation tool for scoring agent quality through LLM judge comparison, running experiments, and generating reports. By default it stores test cases, experiments, runs, and evaluation results as local files. + +When pointed at an OpenSearch cluster, including the one running in the [Observability Stack](/docs/get-started/overview/), Agent Health stores test cases, experiments, runs, and evaluation results in OpenSearch indices instead of local files. If the same cluster is receiving OpenTelemetry traces through the stack pipeline, Agent Health can also read those traces and display them alongside evaluation results, connecting what the agent did with how well it performed. + ## Supported connectors | Connector | Protocol | Description |