[FEATURE] Claude Code plugin to the observability-stack

## Summary

We propose adding a Claude Code plugin to the observability-stack repository that teaches Claude Code how to query traces, logs, and metrics from the running stack using PPL, PromQL, and curl commands. The plugin is a set of markdown skill files with no runtime code and no build step. Claude Code loads them as context to gain OpenSearch-native observability capabilities.

No existing public Claude Code skill covers OpenSearch observability or PPL. This fills that gap.

## Motivation

Developers using AI coding assistants with the observability stack currently have to:

1. Manually look up PPL syntax for every trace or log query
2. Remember the correct curl flags, auth credentials, and API endpoints for OpenSearch and Prometheus
3. Know which index patterns store traces vs. logs vs. service maps
4. Construct cross-signal correlation queries (trace-to-log joins) from scratch
5. Debug stack health issues without structured guidance
6. Build RED metrics dashboards and SLO/SLI monitoring from scratch
7. Figure out how to connect to AWS managed services (Amazon OpenSearch Service, Amazon Managed Prometheus) with SigV4 auth

A Claude Code plugin eliminates this friction. When a developer asks "show me the slowest agent invocations in the last hour", "what's the error budget burn rate for the payment service?", or "why is the payment service erroring?", Claude Code can immediately construct and execute the right PPL or PromQL query against the right endpoint with the right auth.

## Glossary

| Term | Definition |
|------|-----------|
| Plugin | A collection of CLAUDE.md-compatible markdown skill files placed in a project directory that Claude Code loads as context to gain domain-specific capabilities. |
| Skill File | A single markdown file with frontmatter (name, description, allowed-tools) and instructional content that teaches Claude Code a specific capability. |
| PPL | Piped Processing Language, the query language used by OpenSearch for log and trace analytics. Queries are piped commands starting with `source=<index>`. |
| PromQL | Prometheus Query Language used for querying time-series metrics from Prometheus. |
| OpenSearch | The search and analytics engine that stores traces and logs in this stack, accessible at port 9200 with HTTPS and basic authentication. |
| Prometheus | The time-series database that stores metrics in this stack, accessible at port 9090. |
| OTel Collector | The OpenTelemetry Collector that receives telemetry via OTLP protocol on ports 4317 (gRPC) and 4318 (HTTP) and routes data to Data Prepper and Prometheus. |
| Data Prepper | The pipeline processor that transforms and enriches logs and traces before writing them to OpenSearch. |
| Trace Index | The OpenSearch index pattern `otel-v1-apm-span-*` storing trace span data. |
| Log Index | The OpenSearch index pattern `otel-v1-apm-log-*` storing log data. |
| Service Map Index | The OpenSearch index `otel-v2-apm-service-map` storing service dependency topology. |
| Gen AI Attributes | OpenTelemetry semantic convention attributes for generative AI operations, prefixed with `gen_ai.*` (e.g., `gen_ai.operation.name`, `gen_ai.agent.name`, `gen_ai.usage.input_tokens`). |
| Stack | The complete observability infrastructure: OTel Collector, Data Prepper, OpenSearch, Prometheus, and OpenSearch Dashboards. |
| Cross-Signal Correlation | The practice of linking telemetry signals (traces, logs, metrics) using shared identifiers such as `traceId` and `spanId` to enable end-to-end investigation. |
| Exemplar | A Prometheus data structure that links an individual metric sample to a specific trace by carrying `trace_id` and `span_id` alongside the measurement value. Enables metric-to-trace correlation. |
| Test Fixture | A YAML file defining a single integration test case with command, expected status code, expected response fields, and tags. |
| PPL Grammar Source | The official OpenSearch PPL grammar documentation located in the `opensearch-project/sql` repository under `docs/user/ppl/`. |
| RED Metrics | Rate, Errors, Duration: the three golden signals for service-level APM monitoring. Rate measures throughput, Errors measures failure ratio, Duration measures latency distribution. |
| SLI | Service Level Indicator: a quantitative measurement of a service's behavior, such as the ratio of successful requests to total requests. |
| SLO | Service Level Objective: a target value or range for an SLI, such as "99.9% availability over 30 days." |
| Error Budget | The allowed amount of unreliability derived from an SLO. For a 99.9% SLO, the error budget is 0.1%. |
| Burn Rate | The speed at which the error budget is being consumed. A burn rate of 1x means the budget will be exhausted exactly at the end of the SLO window. |
| Recording Rule | A Prometheus configuration that pre-computes and stores the result of a PromQL expression as a new time series, enabling efficient querying of SLI metrics at multiple time windows. |
| AWS SigV4 | AWS Signature Version 4, the authentication protocol used to sign HTTP requests to AWS services including Amazon OpenSearch Service and Amazon Managed Prometheus. |

## Architecture

### System Context

```mermaid
graph TB
 subgraph "Claude Code Plugin"
 CM[CLAUDE.md Entry Point]
 subgraph "skills/"
 TS[traces.md]
 LS[logs.md]
 MS[metrics.md]
 SH[stack-health.md]
 PR[ppl-reference.md]
 CR[correlation.md]
 AR[apm-red.md]
 SL[slo-sli.md]
 end
 subgraph "tests/"
 CF[conftest.py]
 TF[test_fixtures.py]
 TR[test_runner.py]
 FX[fixtures/*.yaml]
 end
 end

 subgraph "Observability Stack"
 OS[OpenSearch :9200 HTTPS + Basic Auth]
 PM[Prometheus :9090 HTTP]
 OC[OTel Collector :4317/:4318]
 DP[Data Prepper :21890]
 end

 CM -->|references| TS
 CM -->|references| LS
 CM -->|references| MS
 CM -->|references| SH
 CM -->|references| PR
 CM -->|references| CR
 CM -->|references| AR
 CM -->|references| SL

 TS -->|PPL queries via curl| OS
 LS -->|PPL queries via curl| OS
 CR -->|PPL queries via curl| OS
 CR -->|PromQL + exemplars via curl| PM
 AR -->|PromQL RED queries via curl| PM
 AR -->|PPL RED queries via curl| OS
 SL -->|PromQL SLO queries via curl| PM
 SH -->|health checks via curl| OS
 SH -->|health checks via curl| PM
 SH -->|health checks via curl| OC
 MS -->|PromQL queries via curl| PM
 PR -->|PPL reference for| OS

 TR -->|validates commands from| FX
 CF -->|checks health of| OS
 CF -->|checks health of| PM
```

### Data Flow

```mermaid
flowchart LR
 A[User asks Claude Code an observability question] --> B[Claude Code reads CLAUDE.md]
 B --> C{Route by intent}
 C -->|trace investigation| D[Load traces.md]
 C -->|log search| E[Load logs.md]
 C -->|metrics query| F[Load metrics.md]
 C -->|stack issues| G[Load stack-health.md]
 C -->|PPL syntax help| H[Load ppl-reference.md]
 C -->|cross-signal correlation| X[Load correlation.md]
 C -->|RED metrics / APM| Y[Load apm-red.md]
 C -->|SLO/SLI / error budget| Z[Load slo-sli.md]
 D --> I[Execute curl command against OpenSearch PPL API]
 E --> I
 F --> J[Execute curl command against Prometheus API]
 G --> K[Execute curl/docker commands against stack endpoints]
 H --> L[Reference for constructing novel PPL queries]
 X --> I
 X --> J
 Y --> I
 Y --> J
 Z --> J
```

## What's Included

### Eight Skill Files

The plugin ships as a `CLAUDE.md` entry point plus eight skill files in a `skills/` directory:

| Skill | What it does | Query language | Target |
|-------|-------------|----------------|--------|
| `traces.md` | Query trace spans: agent invocations, tool executions, slow spans, errors, token usage, trace tree reconstruction, cross-signal correlation | PPL | OpenSearch `:9200` |
| `logs.md` | Query logs: severity filtering, trace correlation, error patterns, log volume, body search | PPL | OpenSearch `:9200` |
| `metrics.md` | Query metrics: HTTP rates, latency percentiles, error rates, GenAI token usage, operation duration | PromQL | Prometheus `:9090` |
| `stack-health.md` | Health checks for all stack components, troubleshooting guide, port reference | curl + docker | All services |
| `ppl-reference.md` | Comprehensive PPL language reference: 50+ commands, 14 function categories, 3 API endpoints | n/a | Reference |
| `correlation.md` | Cross-signal correlation: trace-log joins via PPL, metric-to-trace via Prometheus exemplars, resource-level correlation, investigation workflows | PPL + PromQL | OpenSearch + Prometheus |
| `apm-red.md` | APM RED metrics: per-service request rate, error ratio, latency percentiles (p50/p95/p99), GenAI RED, OTel HTTP semantic conventions | PromQL + PPL | Prometheus + OpenSearch |
| `slo-sli.md` | SLO/SLI monitoring: SLI definitions, Prometheus recording rules, error budgets, multi-window burn rate alerts, compliance reporting | PromQL | Prometheus `:9090` |

### Plugin Directory Structure

```
claude-code-observability-plugin/
├── CLAUDE.md # Entry point, routing table for skills
├── skills/
│ ├── traces.md # Trace querying with PPL
│ ├── logs.md # Log querying with PPL
│ ├── metrics.md # Metrics querying with PromQL
│ ├── stack-health.md # Health checks and troubleshooting
│ ├── ppl-reference.md # Comprehensive PPL language reference
│ ├── correlation.md # Cross-signal correlation workflows
│ ├── apm-red.md # APM RED metrics (Rate, Errors, Duration)
│ └── slo-sli.md # SLO/SLI definitions, error budgets, burn rates
└── tests/
 ├── README.md # Test documentation
 ├── conftest.py # Session fixtures, stack health gate
 ├── test_runner.py # YAML-driven test execution
 ├── models.py # Pydantic test fixture model
 ├── requirements.txt # pytest, pyyaml, pydantic, requests
 └── fixtures/
 ├── traces.yaml # Trace skill test cases
 ├── logs.yaml # Log skill test cases
 ├── metrics.yaml # Metrics skill test cases
 ├── stack-health.yaml # Stack health test cases
 ├── ppl.yaml # PPL reference test cases
 ├── correlation.yaml # Correlation skill test cases
 ├── apm-red.yaml # APM RED skill test cases
 └── slo-sli.yaml # SLO/SLI skill test cases
```

### Skill File Format

Each skill file follows the Claude Code CLAUDE.md convention:

```yaml
---
name: <skill-name>
description: <one-line summary>
allowed-tools:
 - Bash
 - curl
---
```

Every query template is a complete, copy-paste-ready curl command with:
- Correct protocol (HTTPS for OpenSearch, HTTP for Prometheus)
- Authentication (`-u admin:'My_password_123!@#'` for OpenSearch, none for Prometheus)
- Certificate skip (`-k` for development)
- Proper JSON body with PPL/PromQL query
- Backtick escaping for dotted field names in PPL

## Requirements

### Requirement 1: Plugin Directory Structure

As a developer, I want the plugin organized as a directory of skill files with a top-level CLAUDE.md entry point, so that Claude Code automatically loads the observability capabilities when I work in the project.

- The plugin contains a top-level CLAUDE.md that references all skill files
- Skill files live in a single `skills/` directory
- Eight skill files: traces, logs, metrics, stack-health, ppl-reference, correlation, apm-red, and slo-sli
- Each skill file includes frontmatter with `name`, `description`, and `allowed-tools`

### Requirement 2: Traces Skill

As a developer, I want to query trace data from OpenSearch using PPL, so that I can investigate agent invocations, tool executions, slow spans, error spans, and token usage.

- PPL query templates for agent invocation spans (`attributes.gen_ai.operation.name = invoke_agent`)
- PPL query templates for tool execution spans (`attributes.gen_ai.operation.name = execute_tool`)
- Slow span detection where `durationInNanos` exceeds a configurable threshold
- Error span identification where `status.code = 2`
- Token usage aggregation by model and by agent name
- Service operation listing with GenAI operation type breakdown
- Service map queries for dependency exploration
- All GenAI attributes documented with descriptions and example values
- Every PPL query includes the complete curl command with endpoint, auth, and escaping

### Requirement 3: Logs Skill

As a developer, I want to query log data from OpenSearch using PPL, so that I can search logs by severity, correlate logs with traces, identify error patterns, and analyze log volume.

- Severity-based filtering (ERROR, WARN, INFO)
- Trace-to-log correlation via `traceId`
- Error pattern identification with `stats count() by` aggregations
- Log volume trending over time with `span(time, <interval>)`
- Full-text body search with string matching or relevance functions
- Log Index field reference: `severityText`, `severityNumber`, `traceId`, `spanId`, `serviceName`, `body`, `@timestamp`

### Requirement 4: Metrics Skill

As a developer, I want to query metrics from Prometheus using PromQL, so that I can monitor HTTP request rates, latency percentiles, error rates, and active connections.

- HTTP request rate per second grouped by service
- HTTP latency at p95 and p99 by service
- HTTP error rate (5xx) as a ratio
- Active HTTP connections by service
- Database operation latency at p95
- Every PromQL query includes the complete curl command targeting `localhost:9090/api/v1/query`
- Note on PPL as alternative for OpenSearch-ingested metrics

### Requirement 5: Stack Health Skill

As a developer, I want to check the health of all observability stack components and troubleshoot common issues, so that I can verify the stack is operational and diagnose data flow problems.

- Health check curl commands for OpenSearch, Prometheus, OTel Collector
- Index listing and document count verification
- Docker compose commands for container status and logs
- Troubleshooting section for common failures: OpenSearch unreachable, no data in indices, Data Prepper pipeline errors, OTel Collector export failures
- Port reference: OpenSearch (9200), OTel Collector gRPC (4317), OTel Collector HTTP (4318), Data Prepper (21890), Prometheus (9090), OpenSearch Dashboards (5601)
- PPL `describe` for index mapping inspection
- PPL `_explain` endpoint for query plan debugging

### Requirement 6: PPL Reference Skill

As a developer, I want a comprehensive PPL language reference available to Claude Code, so that Claude Code can understand PPL syntax and construct correct queries for any observability question.

**Commands (50+):**
- Core query: `search`, `source`, `where`, `fields`, `stats`, `sort`, `head`, `eval`, `dedup`, `rename`, `top`, `rare`, `table`
- Time-series: `timechart`, `chart`, `bin`, `trendline`, `streamstats`, `eventstats`
- Parse/extract: `parse`, `grok`, `rex`, `regex`, `patterns`, `spath`
- Join/lookup: `join`, `lookup`, `graphlookup`, `subquery`, `append`, `appendcol`, `appendpipe`
- Transform: `fillnull`, `flatten`, `expand`, `transpose`, `convert`, `replace`, `reverse`
- Multi-value: `mvexpand`, `mvcombine`, `nomv`
- Aggregation/totals: `addcoltotals`, `addtotals`
- ML: `ad` (anomaly detection), `kmeans`, `ml`
- System: `describe`, `explain`, `showdatasources`, `multisearch`
- Display: `fieldformat`

**Functions (14 categories):**
- Aggregation: COUNT, SUM, AVG, MAX, MIN, VAR_SAMP, VAR_POP, STDDEV_SAMP, STDDEV_POP, DISTINCT_COUNT, PERCENTILE, EARLIEST, LATEST, LIST, VALUES, FIRST, LAST
- Collection: ARRAY, SPLIT, MVJOIN, MVCOUNT, MVINDEX, MVFIRST, MVLAST, MVAPPEND, MVDEDUP, MVSORT, MVZIP, MVRANGE, MVFILTER
- Condition: ISNULL, ISNOTNULL, IF, IFNULL, NULLIF, CASE, COALESCE, LIKE, IN, BETWEEN
- Conversion: CAST, TOSTRING, TONUMBER, TOINT, TOLONG, TOFLOAT, TODOUBLE, TOBOOLEAN
- Cryptographic: MD5, SHA1, SHA2
- Datetime: NOW, CURDATE, CURTIME, DATE_FORMAT, DATE_ADD, DATE_SUB, DATEDIFF, DAY, MONTH, YEAR, HOUR, MINUTE, SECOND, DAYOFWEEK, DAYOFYEAR, WEEK, UNIX_TIMESTAMP, FROM_UNIXTIME, and more
- Expressions: arithmetic (+, -, *, /), comparison (=, !=, <, >, <=, >=), logical (AND, OR, NOT, XOR)
- IP: CIDRMATCH, GEOIP
- JSON: JSON_EXTRACT, JSON_KEYS, JSON_VALID, JSON_ARRAY, JSON_OBJECT, JSON_ARRAY_LENGTH, JSON_EXTRACT_PATH_TEXT, TO_JSON_STRING
- Math: ABS, CEIL, FLOOR, ROUND, SQRT, POW, MOD, LOG, LOG2, LOG10, LN, EXP, and more
- Relevance: MATCH, MATCH_PHRASE, MULTI_MATCH, QUERY_STRING, SIMPLE_QUERY_STRING, HIGHLIGHT, SCORE, WILDCARD_QUERY
- Statistical: CORR, COVAR_POP, COVAR_SAMP
- String: CONCAT, LENGTH, LOWER, UPPER, TRIM, SUBSTRING, REPLACE, REGEXP, REGEXP_EXTRACT, REGEXP_REPLACE, and more
- System: TYPEOF

**API Endpoints:**
- Query execution: `POST /_plugins/_ppl` with JSON body `{"query": "<ppl_query>"}`
- Query explain: `POST /_plugins/_ppl/_explain`
- Grammar metadata: `GET /_plugins/_ppl/_grammar`

**Source:** Grammar reference sourced from the [`opensearch-project/sql`](https://github.com/opensearch-project/sql) repository's `docs/user/ppl/` directory.

### Requirement 7: Skill File Format Compliance

- Each skill file is valid markdown with YAML frontmatter delimited by `---`
- Frontmatter contains `name`, `description`, and `allowed-tools` fields
- Top-level CLAUDE.md references each skill file path with a one-line summary
- Credentials sourced from `.env` file (admin / `My_password_123!@#`), noted as configurable

### Requirement 8: Authentication and Connection Details

| Service | Protocol | Port | Auth |
|---------|----------|------|------|
| OpenSearch (local) | HTTPS | 9200 | Basic auth (`admin` / `My_password_123!@#`), `-k` flag for cert skip |
| OpenSearch (AWS managed) | HTTPS | 443 | AWS SigV4 (`--aws-sigv4 "aws:amz:REGION:es"`) |
| Prometheus (local) | HTTP | 9090 | None |
| Prometheus (AWS managed) | HTTPS | 443 | AWS SigV4 (`--aws-sigv4 "aws:amz:REGION:aps"`) |
| OTel Collector | HTTP | 4317 (gRPC), 4318 (HTTP) | None |
| Data Prepper | HTTP | 21890 | None |
| OpenSearch Dashboards | HTTP | 5601 | Same as OpenSearch |

All credentials are sourced from the repository `.env` file. The test harness reads `.env` with fallback to these defaults.

Skill files provide curl command variants for both local and AWS managed endpoints. The CLAUDE.md entry point includes a configuration section where users set `$OPENSEARCH_ENDPOINT` and `$PROMETHEUS_ENDPOINT` environment variables to switch between local and managed services. PPL and PromQL query syntax is identical across both profiles; only the endpoint URL and authentication method differ.

### Requirement 9: PPL Grammar Source Documentation

- Grammar reference sourced from `opensearch-project/sql` repository's `docs/user/ppl/` directory
- Repository URL included: `https://github.com/opensearch-project/sql`
- Commands organized into logical categories
- Functions organized into categories matching the source repository

### Requirement 10: Cross-Signal Correlation and GenAI Debugging

As a developer, I want the plugin skills to support cross-signal correlation between traces, logs, and metrics, and provide GenAI-specific debugging capabilities, so that I can perform end-to-end observability investigations across all telemetry signals.

**Cross-signal correlation:**
- Trace-to-log joins by matching `traceId` across Trace Index and Log Index
- Log-to-span correlation by `spanId`
- Full trace tree reconstruction by `traceId` with `parentSpanId` hierarchy
- Latency gap analysis between parent and child spans
- Root span identification where `parentSpanId` is empty or null

**GenAI operation types (beyond invoke_agent and execute_tool):**
- `chat`, `embeddings`, `retrieval`, `create_agent`, `text_completion`, `generate_content`

**Exception and error querying:**
- Span events with `exception.type`, `exception.message`, `exception.stacktrace`
- Spans with `error.type` for error categorization
- Exception-to-log correlation via shared `traceId` and `spanId`

**Extended GenAI attributes:**
- `gen_ai.agent.id`, `gen_ai.agent.description`, `gen_ai.agent.version`
- `gen_ai.conversation.id` for multi-turn conversation tracking
- `gen_ai.tool.call.id`, `gen_ai.tool.type`, `gen_ai.tool.call.arguments`, `gen_ai.tool.call.result`

**GenAI-specific metrics:**
- `gen_ai_client_token_usage` histogram grouped by operation and model
- `gen_ai_client_operation_duration` histogram grouped by operation and model

### Requirement 11: Integration Test Harness

As a developer, I want an integration test suite that validates all skill file commands against a running observability stack, so that I can verify the plugin's queries and health checks produce correct results.

**Test infrastructure:**
- pytest test suite in a `tests/` directory within the plugin
- YAML fixture files defining test cases with `command`, `expected_status_code`, `expected_fields`, and `tags`
- Pydantic model for strict schema validation (`extra="forbid"`)
- Session-scoped fixture that checks stack health before tests run
- All tests skipped with clear message if stack is not running

**Test categories:**
- `traces`: PPL queries against Trace Index, validate `schema` and `datarows` in response
- `logs`: PPL queries against Log Index, validate response structure
- `metrics`: PromQL queries against Prometheus, validate `status: "success"` and `data` field
- `stack-health`: Health check commands, validate HTTP 200 status codes
- `ppl`: PPL system commands (`describe`, `_explain`), validate response structure
- `correlation`: Cross-signal correlation queries, validate join results and exemplar responses
- `apm_red`: RED metric queries against Prometheus and OpenSearch, validate rate/error/duration responses
- `slo_sli`: SLO/SLI queries against Prometheus, validate recording rule outputs and burn rate calculations

**Test execution:**
- Commands executed via `subprocess.run` with configurable timeout (default 30s)
- JSON response parsing with recursive field lookup for `expected_fields`
- pytest markers for tag-based filtering (`pytest -m traces`)
- `before_test` and `after_test` hooks in YAML for setup/teardown scripts

**Configuration:**
- Connection details read from `.env` with fallback defaults
- Dependencies: `pytest`, `pyyaml`, `pydantic`, `requests`, `hypothesis`
- README documenting how to run tests, prerequisites, and how to add new test cases

### Requirement 12: Correlation Skill

As a developer, I want a dedicated correlation skill that teaches Claude Code how to join traces, logs, and metrics across all three telemetry signals using OTel semantic convention correlation fields, so that I can perform end-to-end investigations starting from any signal.

**OTel correlation fields (sourced from [opentelemetry.io](https://opentelemetry.io)):**

The OTel specification defines three correlation mechanisms across signals:

| Mechanism | Fields | Signals Connected | How It Works |
|-----------|--------|-------------------|--------------|
| Trace context | `traceId`, `spanId`, `traceFlags` | Traces + Logs | Both span records and log records carry the same `traceId`/`spanId`, enabling direct joins |
| Exemplars | `trace_id`, `span_id`, `filtered_attributes` | Metrics + Traces | Prometheus exemplars attach trace context to individual metric samples |
| Resource attributes | `service.name`, `service.namespace`, `service.version`, `service.instance.id` | All three signals | Every span, metric data point, and log record from the same service carries identical resource attributes |

**GenAI resource attributes promoted to Prometheus labels in this stack:**
- `gen_ai.agent.id`, `gen_ai.agent.name`, `gen_ai.provider.name`, `gen_ai.request.model`, `gen_ai.response.model`
- These are configured in `docker-compose/prometheus/prometheus.yml` under `otlp.promote_resource_attributes`
- This enables PromQL queries filtered by agent or model that can then be correlated to traces via exemplars

**Trace-to-log correlation (PPL):**
- Find all logs for a trace: `source=otel-v1-apm-log-* | WHERE traceId = '<id>'`
- Find logs for a specific span: `source=otel-v1-apm-log-* | WHERE spanId = '<id>'`
- Join spans with logs: PPL `join` across Trace Index and Log Index on `traceId`
- Full timeline reconstruction: all spans + all logs for a `traceId`, sorted by timestamp

**Log-to-trace correlation (PPL):**
- From an error log, extract `traceId` and query the Trace Index for the full trace tree
- From a log entry, extract `spanId` and find the exact span that produced it

**Metric-to-trace correlation (PromQL + exemplars):**
- Query Prometheus exemplars API: `GET /api/v1/query_exemplars?query=<metric>&start=<start>&end=<end>`
- Extract `trace_id` from exemplar, then query Trace Index via PPL
- Filter metrics by GenAI labels (`gen_ai_agent_name`, `gen_ai_request_model`), then correlate to traces

**Resource-level correlation:**
- `serviceName` in traces/logs maps to `service_name` label in Prometheus metrics
- Query all signals for a specific service to get the complete picture

**Investigation workflows:**
- Metric spike investigation: PromQL anomaly detection, exemplars, trace tree, correlated logs
- Error log investigation: find error logs, extract traceId, reconstruct trace, identify root cause span
- Slow agent investigation: find slow invoke_agent spans, get child spans, correlated logs, token usage metrics

### Requirement 13: APM/RED Metrics Skill

As a developer, I want a dedicated APM skill that teaches Claude Code how to construct RED (Rate, Errors, Duration) metrics queries for any service, so that I can quickly assess service health using the standard APM methodology.

- Rate queries: per-service request rate via PromQL (`rate(http_server_duration_seconds_count[5m])`), per-endpoint rate, and PPL alternative from trace spans
- Error queries: error rate as a ratio (5xx / total) via PromQL, error count from trace spans via PPL (`status.code = 2`)
- Duration queries: latency percentiles (p50, p95, p99) via PromQL `histogram_quantile` and PPL `percentile()` from trace spans
- Combined RED dashboard query set for all services in a single investigation workflow
- GenAI-specific RED metrics using `gen_ai_client_operation_duration` histogram
- OTel HTTP semantic convention metrics reference: `http.server.request.duration` (histogram), `http.server.active_requests` (gauge), and their Prometheus-exported equivalents
- OTel Collector `spanmetrics` connector documentation for auto-generating RED metrics from traces
- Every query template includes the complete curl command with the appropriate endpoint and authentication

### Requirement 14: SLO/SLI Skill

As a developer, I want a dedicated SLO/SLI skill that teaches Claude Code how to define SLIs, calculate error budgets, and construct burn rate queries using Prometheus recording rules, so that I can implement and monitor service level objectives for my services.

- SLI definition templates: availability SLI (successful/total ratio), latency SLI (within-threshold/total ratio), GenAI-specific SLI
- Prometheus recording rule YAML templates for pre-computing SLIs at multiple time windows (5m, 30m, 1h, 6h, 1d, 3d, 30d)
- Recording rule naming conventions: `sli:http_availability:ratio_rate<window>`, `sli:http_latency:ratio_rate<window>`
- Error budget calculation: remaining budget given an SLO target, consumption rate, common SLO targets (99.9%, 99.5%, 99.0%) with allowed downtime
- Burn rate queries: single-window and multi-window (Google SRE book pattern: 14.4x fast burn 1h/6h, 1x slow burn 3d/30d)
- Prometheus alerting rule YAML templates for burn rate alerts
- SLO compliance reporting: current SLI value, SLO target, error budget remaining, burn rate per service
- Step-by-step SLO setup workflow: define SLIs, add recording rules, set targets, add burn rate alerts, query compliance
- Every query template includes the complete curl command with the appropriate Prometheus endpoint and authentication

## Data Models

### OpenSearch Trace Index Schema (otel-v1-apm-span-*)

| Field | Type | Description |
|-------|------|-------------|
| `traceId` | keyword | Unique trace identifier |
| `spanId` | keyword | Unique span identifier |
| `parentSpanId` | keyword | Parent span ID (empty for root) |
| `serviceName` | keyword | Service that produced the span |
| `name` | text | Span operation name |
| `kind` | keyword | Span kind (SERVER, CLIENT, INTERNAL, etc.) |
| `startTime` | date | Span start timestamp |
| `endTime` | date | Span end timestamp |
| `durationInNanos` | long | Span duration in nanoseconds |
| `status.code` | integer | Status code (0=Unset, 1=Ok, 2=Error) |
| `attributes.gen_ai.operation.name` | keyword | GenAI operation type |
| `attributes.gen_ai.agent.name` | keyword | Agent name |
| `attributes.gen_ai.agent.id` | keyword | Agent identifier |
| `attributes.gen_ai.request.model` | keyword | Requested model |
| `attributes.gen_ai.usage.input_tokens` | long | Input token count |
| `attributes.gen_ai.usage.output_tokens` | long | Output token count |
| `attributes.gen_ai.tool.name` | keyword | Tool name |
| `attributes.gen_ai.tool.call.id` | keyword | Tool call identifier |
| `attributes.gen_ai.tool.call.arguments` | text | Tool call arguments |
| `attributes.gen_ai.tool.call.result` | text | Tool call result |
| `attributes.gen_ai.conversation.id` | keyword | Conversation identifier |
| `events.attributes.exception.type` | keyword | Exception type |
| `events.attributes.exception.message` | text | Exception message |
| `events.attributes.exception.stacktrace` | text | Exception stacktrace |

### OpenSearch Log Index Schema (otel-v1-apm-log-*)

| Field | Type | Description |
|-------|------|-------------|
| `traceId` | keyword | Correlated trace identifier |
| `spanId` | keyword | Correlated span identifier |
| `severityText` | keyword | Log level (ERROR, WARN, INFO, DEBUG) |
| `severityNumber` | integer | Numeric severity |
| `serviceName` | keyword | Service that produced the log |
| `body` | text | Log message body |
| `@timestamp` | date | Log timestamp |

### OpenSearch Service Map Index (otel-v2-apm-service-map)

| Field | Type | Description |
|-------|------|-------------|
| `serviceName` | keyword | Source service |
| `destination.domain` | keyword | Destination service |
| `destination.resource` | keyword | Destination resource |
| `traceGroupName` | keyword | Trace group |

### Prometheus Metrics

| Metric | Type | Labels |
|--------|------|--------|
| `http_server_duration_seconds` | histogram | `service_name`, `http_response_status_code` |
| `http_server_active_requests` | gauge | `service_name` |
| `db_client_operation_duration_seconds` | histogram | `service_name` |
| `gen_ai_client_token_usage` | histogram | `gen_ai.operation.name`, `gen_ai.request.model` |
| `gen_ai_client_operation_duration` | histogram | `gen_ai.operation.name`, `gen_ai.request.model` |

### Connection Profiles

| Profile | OpenSearch Endpoint | OpenSearch Auth | Prometheus Endpoint | Prometheus Auth |
|---------|-------------------|-----------------|--------------------|-----------------| 
| Local | `https://localhost:9200` | Basic auth (`-u admin:'My_password_123!@#' -k`) | `http://localhost:9090` | None |
| AWS Managed | `https://DOMAIN-ID.REGION.es.amazonaws.com` | AWS SigV4 (`--aws-sigv4 "aws:amz:REGION:es"`) | `https://aps-workspaces.REGION.amazonaws.com/workspaces/WORKSPACE_ID` | AWS SigV4 (`--aws-sigv4 "aws:amz:REGION:aps"`) |

### Test Fixture YAML Schema

```yaml
- name: "agent_invocations"
 description: "Query all agent invocation spans"
 command: |
 curl -sk -u admin:'My_password_123!@#' \
 -X POST https://localhost:9200/_plugins/_ppl \
 -H 'Content-Type: application/json' \
 -d '{"query": "source=otel-v1-apm-span-* | WHERE `attributes.gen_ai.operation.name` = '\''invoke_agent'\'' | head 10"}'
 expected_status_code: 200
 expected_fields: ["schema", "datarows"]
 tags: ["traces"]
 before_test: null
 after_test: null
```

## Design Decisions

**Why a flat `skills/` directory?**
Eight files don't need subdirectories. Flat is simpler to reference from CLAUDE.md and easier for contributors to navigate.

**Why complete curl commands instead of just query bodies?**
Claude Code can execute curl directly via its Bash tool. Including the full command (endpoint, auth, headers, body) means zero assembly required. The skill file is the executable documentation.

**Why a dedicated PPL reference file?**
The PPL grammar is large (50+ commands, 14 function categories). Inlining it into traces.md or logs.md would bloat those files. As a separate skill, Claude Code loads it on demand when it needs to construct a novel query.

**Why YAML test fixtures instead of inline pytest?**
Declarative YAML fixtures are easier for contributors to add (no Python knowledge needed to add a test case). The Pydantic schema catches malformed fixtures at load time. This pattern is proven at scale in [HolmesGPT](https://github.com/robusta-dev/holmesGPT)'s test suite.

**Why read credentials from `.env`?**
The observability stack already centralizes configuration in `.env`. The plugin and test harness reuse the same source of truth rather than duplicating credentials.

## Error Handling

### Skill File Errors

| Scenario | Handling |
|----------|----------|
| OpenSearch unreachable | Stack health skill provides diagnostic steps: check `docker compose ps`, verify port 9200, check health endpoint |
| Prometheus unreachable | Stack health skill suggests checking container status and port 9090 |
| PPL query syntax error | PPL reference skill provides syntax guidance; `_explain` endpoint helps debug query plans |
| Authentication failure | Skill files document correct credentials from `.env`; stack health skill suggests verifying credentials |
| No data in indices | Stack health skill provides index listing commands and document count verification |
| Data Prepper pipeline errors | Stack health skill suggests checking Data Prepper logs via `docker compose logs data-prepper` |
| OTel Collector export failures | Stack health skill suggests checking collector metrics at port 8888 and logs |

### Test Harness Errors

| Scenario | Handling |
|----------|----------|
| Stack not running | Session-scoped fixture detects this and skips all tests with clear message |
| Curl command timeout | Configurable timeout (default 30s); test fails with timeout error |
| Invalid YAML fixture | Pydantic model with `extra="forbid"` raises validation error at load time |
| Unexpected JSON response | Test reports which expected_fields were missing from the response |
| Hook failure | Test reports before_test/after_test hook failure separately from the main command result |
| Missing .env file | Config loader falls back to hardcoded defaults |

## Running the Tests

Prerequisites: the observability stack must be running (`docker compose up -d`).

```bash
cd claude-code-observability-plugin/tests

# Install dependencies
pip install -r requirements.txt

# Run all tests
pytest

# Run by category
pytest -m traces
pytest -m logs
pytest -m metrics
pytest -m stack_health
pytest -m ppl

# Verbose output
pytest -v --tb=short
```

If the stack is not running, all tests are skipped with a clear message.

## Open Questions

1. **Plugin location**: Should the plugin live at the repo root (`claude-code-observability-plugin/`) or under a new `plugins/` directory?

2. **Versioning**: Should the plugin version track the observability stack version, or have its own independent version?

3. **Additional AI assistants**: The skill file format is Claude Code-specific (CLAUDE.md convention). Should we also provide equivalent configurations for other AI coding assistants (e.g., Cursor rules, Kiro steering)?

4. **Metrics in OpenSearch**: The metrics skill currently targets Prometheus. Should we also include PPL queries for metrics stored in OpenSearch (when metrics are ingested via Data Prepper)?

5. **Example telemetry data**: Should the test harness include a script that sends sample telemetry data to the stack, so tests can validate queries return actual results rather than just valid empty responses?

## How to Contribute

Adding a new query template to a skill file:
1. Add the curl command to the appropriate `skills/*.md` file
2. Add a corresponding test fixture in `tests/fixtures/*.yaml`
3. Run `pytest` to verify the command works against a running stack

Adding a new test case:
1. Create a YAML entry in the appropriate `tests/fixtures/*.yaml` file
2. Follow the schema: `name`, `description`, `command`, `expected_status_code`, `expected_fields`, `tags`
3. Run `pytest -m <tag>` to verify

## Feedback Requested

We'd like feedback on:
- The skill file organization and routing approach
- Which query templates are most valuable for your workflow
- The open questions above
- Any missing capabilities or query patterns you'd want included
- The integration test approach and fixture format

Please comment on this RFC or open an issue with your thoughts.

Metric	Type	Labels
`http_server_duration_seconds`	histogram	`service_name`, `http_response_status_code`
`http_server_active_requests`	gauge	`service_name`
`db_client_operation_duration_seconds`	histogram	`service_name`
`gen_ai_client_token_usage`	histogram	`gen_ai.operation.name`, `gen_ai.request.model`
`gen_ai_client_operation_duration`	histogram	`gen_ai.operation.name`, `gen_ai.request.model`

Term	Definition
Plugin	A collection of CLAUDE.md-compatible markdown skill files placed in a project directory that Claude Code loads as context to gain domain-specific capabilities.
Skill File	A single markdown file with frontmatter (name, description, allowed-tools) and instructional content that teaches Claude Code a specific capability.
PPL	Piped Processing Language, the query language used by OpenSearch for log and trace analytics. Queries are piped commands starting with `source=<index>`.
PromQL	Prometheus Query Language used for querying time-series metrics from Prometheus.
OpenSearch	The search and analytics engine that stores traces and logs in this stack, accessible at port 9200 with HTTPS and basic authentication.
Prometheus	The time-series database that stores metrics in this stack, accessible at port 9090.
OTel Collector	The OpenTelemetry Collector that receives telemetry via OTLP protocol on ports 4317 (gRPC) and 4318 (HTTP) and routes data to Data Prepper and Prometheus.
Data Prepper	The pipeline processor that transforms and enriches logs and traces before writing them to OpenSearch.
Trace Index	The OpenSearch index pattern `otel-v1-apm-span-*` storing trace span data.
Log Index	The OpenSearch index pattern `otel-v1-apm-log-*` storing log data.
Service Map Index	The OpenSearch index `otel-v2-apm-service-map` storing service dependency topology.
Gen AI Attributes	OpenTelemetry semantic convention attributes for generative AI operations, prefixed with `gen_ai.*` (e.g., `gen_ai.operation.name`, `gen_ai.agent.name`, `gen_ai.usage.input_tokens`).
Stack	The complete observability infrastructure: OTel Collector, Data Prepper, OpenSearch, Prometheus, and OpenSearch Dashboards.
Cross-Signal Correlation	The practice of linking telemetry signals (traces, logs, metrics) using shared identifiers such as `traceId` and `spanId` to enable end-to-end investigation.
Exemplar	A Prometheus data structure that links an individual metric sample to a specific trace by carrying `trace_id` and `span_id` alongside the measurement value. Enables metric-to-trace correlation.
Test Fixture	A YAML file defining a single integration test case with command, expected status code, expected response fields, and tags.
PPL Grammar Source	The official OpenSearch PPL grammar documentation located in the `opensearch-project/sql` repository under `docs/user/ppl/`.
RED Metrics	Rate, Errors, Duration: the three golden signals for service-level APM monitoring. Rate measures throughput, Errors measures failure ratio, Duration measures latency distribution.
SLI	Service Level Indicator: a quantitative measurement of a service's behavior, such as the ratio of successful requests to total requests.
SLO	Service Level Objective: a target value or range for an SLI, such as "99.9% availability over 30 days."
Error Budget	The allowed amount of unreliability derived from an SLO. For a 99.9% SLO, the error budget is 0.1%.
Burn Rate	The speed at which the error budget is being consumed. A burn rate of 1x means the budget will be exhausted exactly at the end of the SLO window.
Recording Rule	A Prometheus configuration that pre-computes and stores the result of a PromQL expression as a new time series, enabling efficient querying of SLI metrics at multiple time windows.
AWS SigV4	AWS Signature Version 4, the authentication protocol used to sign HTTP requests to AWS services including Amazon OpenSearch Service and Amazon Managed Prometheus.

Skill	What it does	Query language	Target
`traces.md`	Query trace spans: agent invocations, tool executions, slow spans, errors, token usage, trace tree reconstruction, cross-signal correlation	PPL	OpenSearch `:9200`
`logs.md`	Query logs: severity filtering, trace correlation, error patterns, log volume, body search	PPL	OpenSearch `:9200`
`metrics.md`	Query metrics: HTTP rates, latency percentiles, error rates, GenAI token usage, operation duration	PromQL	Prometheus `:9090`
`stack-health.md`	Health checks for all stack components, troubleshooting guide, port reference	curl + docker	All services
`ppl-reference.md`	Comprehensive PPL language reference: 50+ commands, 14 function categories, 3 API endpoints	n/a	Reference
`correlation.md`	Cross-signal correlation: trace-log joins via PPL, metric-to-trace via Prometheus exemplars, resource-level correlation, investigation workflows	PPL + PromQL	OpenSearch + Prometheus
`apm-red.md`	APM RED metrics: per-service request rate, error ratio, latency percentiles (p50/p95/p99), GenAI RED, OTel HTTP semantic conventions	PromQL + PPL	Prometheus + OpenSearch
`slo-sli.md`	SLO/SLI monitoring: SLI definitions, Prometheus recording rules, error budgets, multi-window burn rate alerts, compliance reporting	PromQL	Prometheus `:9090`

Service	Protocol	Port	Auth
OpenSearch (local)	HTTPS	9200	Basic auth (`admin` / `My_password_123!@#`), `-k` flag for cert skip
OpenSearch (AWS managed)	HTTPS	443	AWS SigV4 (`--aws-sigv4 "aws:amz:REGION:es"`)
Prometheus (local)	HTTP	9090	None
Prometheus (AWS managed)	HTTPS	443	AWS SigV4 (`--aws-sigv4 "aws:amz:REGION:aps"`)
OTel Collector	HTTP	4317 (gRPC), 4318 (HTTP)	None
Data Prepper	HTTP	21890	None
OpenSearch Dashboards	HTTP	5601	Same as OpenSearch

Mechanism	Fields	Signals Connected	How It Works
Trace context	`traceId`, `spanId`, `traceFlags`	Traces + Logs	Both span records and log records carry the same `traceId`/`spanId`, enabling direct joins
Exemplars	`trace_id`, `span_id`, `filtered_attributes`	Metrics + Traces	Prometheus exemplars attach trace context to individual metric samples
Resource attributes	`service.name`, `service.namespace`, `service.version`, `service.instance.id`	All three signals	Every span, metric data point, and log record from the same service carries identical resource attributes

Field	Type	Description
`traceId`	keyword	Unique trace identifier
`spanId`	keyword	Unique span identifier
`parentSpanId`	keyword	Parent span ID (empty for root)
`serviceName`	keyword	Service that produced the span
`name`	text	Span operation name
`kind`	keyword	Span kind (SERVER, CLIENT, INTERNAL, etc.)
`startTime`	date	Span start timestamp
`endTime`	date	Span end timestamp
`durationInNanos`	long	Span duration in nanoseconds
`status.code`	integer	Status code (0=Unset, 1=Ok, 2=Error)
`attributes.gen_ai.operation.name`	keyword	GenAI operation type
`attributes.gen_ai.agent.name`	keyword	Agent name
`attributes.gen_ai.agent.id`	keyword	Agent identifier
`attributes.gen_ai.request.model`	keyword	Requested model
`attributes.gen_ai.usage.input_tokens`	long	Input token count
`attributes.gen_ai.usage.output_tokens`	long	Output token count
`attributes.gen_ai.tool.name`	keyword	Tool name
`attributes.gen_ai.tool.call.id`	keyword	Tool call identifier
`attributes.gen_ai.tool.call.arguments`	text	Tool call arguments
`attributes.gen_ai.tool.call.result`	text	Tool call result
`attributes.gen_ai.conversation.id`	keyword	Conversation identifier
`events.attributes.exception.type`	keyword	Exception type
`events.attributes.exception.message`	text	Exception message
`events.attributes.exception.stacktrace`	text	Exception stacktrace

Field	Type	Description
`traceId`	keyword	Correlated trace identifier
`spanId`	keyword	Correlated span identifier
`severityText`	keyword	Log level (ERROR, WARN, INFO, DEBUG)
`severityNumber`	integer	Numeric severity
`serviceName`	keyword	Service that produced the log
`body`	text	Log message body
`@timestamp`	date	Log timestamp

Field	Type	Description
`serviceName`	keyword	Source service
`destination.domain`	keyword	Destination service
`destination.resource`	keyword	Destination resource
`traceGroupName`	keyword	Trace group

Profile	OpenSearch Endpoint	OpenSearch Auth	Prometheus Endpoint	Prometheus Auth
Local	`https://localhost:9200`	Basic auth (`-u admin:'My_password_123!@#' -k`)	`http://localhost:9090`	None
AWS Managed	`https://DOMAIN-ID.REGION.es.amazonaws.com`	AWS SigV4 (`--aws-sigv4 "aws:amz:REGION:es"`)	`https://aps-workspaces.REGION.amazonaws.com/workspaces/WORKSPACE_ID`	AWS SigV4 (`--aws-sigv4 "aws:amz:REGION:aps"`)

Scenario	Handling
OpenSearch unreachable	Stack health skill provides diagnostic steps: check `docker compose ps`, verify port 9200, check health endpoint
Prometheus unreachable	Stack health skill suggests checking container status and port 9090
PPL query syntax error	PPL reference skill provides syntax guidance; `_explain` endpoint helps debug query plans
Authentication failure	Skill files document correct credentials from `.env`; stack health skill suggests verifying credentials
No data in indices	Stack health skill provides index listing commands and document count verification
Data Prepper pipeline errors	Stack health skill suggests checking Data Prepper logs via `docker compose logs data-prepper`
OTel Collector export failures	Stack health skill suggests checking collector metrics at port 8888 and logs

Scenario	Handling
Stack not running	Session-scoped fixture detects this and skips all tests with clear message
Curl command timeout	Configurable timeout (default 30s); test fails with timeout error
Invalid YAML fixture	Pydantic model with `extra="forbid"` raises validation error at load time
Unexpected JSON response	Test reports which expected_fields were missing from the response
Hook failure	Test reports before_test/after_test hook failure separately from the main command result
Missing .env file	Config loader falls back to hardcoded defaults

[FEATURE] Claude Code plugin to the observability-stack #119

Description

Summary

Motivation

Glossary

Architecture

System Context

Data Flow

What's Included

Eight Skill Files

Plugin Directory Structure

Skill File Format

Requirements

Requirement 1: Plugin Directory Structure

Requirement 2: Traces Skill

Requirement 3: Logs Skill

Requirement 4: Metrics Skill

Requirement 5: Stack Health Skill

Requirement 6: PPL Reference Skill

Requirement 7: Skill File Format Compliance

Requirement 8: Authentication and Connection Details

Requirement 9: PPL Grammar Source Documentation

Requirement 10: Cross-Signal Correlation and GenAI Debugging

Requirement 11: Integration Test Harness

Requirement 12: Correlation Skill

Requirement 13: APM/RED Metrics Skill

Requirement 14: SLO/SLI Skill

Data Models

OpenSearch Trace Index Schema (otel-v1-apm-span-*)

OpenSearch Log Index Schema (otel-v1-apm-log-*)

OpenSearch Service Map Index (otel-v2-apm-service-map)

Prometheus Metrics

Connection Profiles

Test Fixture YAML Schema

Design Decisions

Error Handling

Skill File Errors

Test Harness Errors

Running the Tests

Open Questions

How to Contribute

Feedback Requested

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions