Skip to content

[FEATURE] Claude Code plugin to the observability-stack #119

@anirudha

Description

@anirudha

Summary

We propose adding a Claude Code plugin to the observability-stack repository that teaches Claude Code how to query traces, logs, and metrics from the running stack using PPL, PromQL, and curl commands. The plugin is a set of markdown skill files with no runtime code and no build step. Claude Code loads them as context to gain OpenSearch-native observability capabilities.

No existing public Claude Code skill covers OpenSearch observability or PPL. This fills that gap.

Motivation

Developers using AI coding assistants with the observability stack currently have to:

  1. Manually look up PPL syntax for every trace or log query
  2. Remember the correct curl flags, auth credentials, and API endpoints for OpenSearch and Prometheus
  3. Know which index patterns store traces vs. logs vs. service maps
  4. Construct cross-signal correlation queries (trace-to-log joins) from scratch
  5. Debug stack health issues without structured guidance
  6. Build RED metrics dashboards and SLO/SLI monitoring from scratch
  7. Figure out how to connect to AWS managed services (Amazon OpenSearch Service, Amazon Managed Prometheus) with SigV4 auth

A Claude Code plugin eliminates this friction. When a developer asks "show me the slowest agent invocations in the last hour", "what's the error budget burn rate for the payment service?", or "why is the payment service erroring?", Claude Code can immediately construct and execute the right PPL or PromQL query against the right endpoint with the right auth.

Glossary

Term Definition
Plugin A collection of CLAUDE.md-compatible markdown skill files placed in a project directory that Claude Code loads as context to gain domain-specific capabilities.
Skill File A single markdown file with frontmatter (name, description, allowed-tools) and instructional content that teaches Claude Code a specific capability.
PPL Piped Processing Language, the query language used by OpenSearch for log and trace analytics. Queries are piped commands starting with source=<index>.
PromQL Prometheus Query Language used for querying time-series metrics from Prometheus.
OpenSearch The search and analytics engine that stores traces and logs in this stack, accessible at port 9200 with HTTPS and basic authentication.
Prometheus The time-series database that stores metrics in this stack, accessible at port 9090.
OTel Collector The OpenTelemetry Collector that receives telemetry via OTLP protocol on ports 4317 (gRPC) and 4318 (HTTP) and routes data to Data Prepper and Prometheus.
Data Prepper The pipeline processor that transforms and enriches logs and traces before writing them to OpenSearch.
Trace Index The OpenSearch index pattern otel-v1-apm-span-* storing trace span data.
Log Index The OpenSearch index pattern otel-v1-apm-log-* storing log data.
Service Map Index The OpenSearch index otel-v2-apm-service-map storing service dependency topology.
Gen AI Attributes OpenTelemetry semantic convention attributes for generative AI operations, prefixed with gen_ai.* (e.g., gen_ai.operation.name, gen_ai.agent.name, gen_ai.usage.input_tokens).
Stack The complete observability infrastructure: OTel Collector, Data Prepper, OpenSearch, Prometheus, and OpenSearch Dashboards.
Cross-Signal Correlation The practice of linking telemetry signals (traces, logs, metrics) using shared identifiers such as traceId and spanId to enable end-to-end investigation.
Exemplar A Prometheus data structure that links an individual metric sample to a specific trace by carrying trace_id and span_id alongside the measurement value. Enables metric-to-trace correlation.
Test Fixture A YAML file defining a single integration test case with command, expected status code, expected response fields, and tags.
PPL Grammar Source The official OpenSearch PPL grammar documentation located in the opensearch-project/sql repository under docs/user/ppl/.
RED Metrics Rate, Errors, Duration: the three golden signals for service-level APM monitoring. Rate measures throughput, Errors measures failure ratio, Duration measures latency distribution.
SLI Service Level Indicator: a quantitative measurement of a service's behavior, such as the ratio of successful requests to total requests.
SLO Service Level Objective: a target value or range for an SLI, such as "99.9% availability over 30 days."
Error Budget The allowed amount of unreliability derived from an SLO. For a 99.9% SLO, the error budget is 0.1%.
Burn Rate The speed at which the error budget is being consumed. A burn rate of 1x means the budget will be exhausted exactly at the end of the SLO window.
Recording Rule A Prometheus configuration that pre-computes and stores the result of a PromQL expression as a new time series, enabling efficient querying of SLI metrics at multiple time windows.
AWS SigV4 AWS Signature Version 4, the authentication protocol used to sign HTTP requests to AWS services including Amazon OpenSearch Service and Amazon Managed Prometheus.

Architecture

System Context

graph TB
    subgraph "Claude Code Plugin"
        CM[CLAUDE.md<br/>Entry Point]
        subgraph "skills/"
            TS[traces.md]
            LS[logs.md]
            MS[metrics.md]
            SH[stack-health.md]
            PR[ppl-reference.md]
            CR[correlation.md]
            AR[apm-red.md]
            SL[slo-sli.md]
        end
        subgraph "tests/"
            CF[conftest.py]
            TF[test_fixtures.py]
            TR[test_runner.py]
            FX[fixtures/*.yaml]
        end
    end

    subgraph "Observability Stack"
        OS[OpenSearch :9200<br/>HTTPS + Basic Auth]
        PM[Prometheus :9090<br/>HTTP]
        OC[OTel Collector :4317/:4318]
        DP[Data Prepper :21890]
    end

    CM -->|references| TS
    CM -->|references| LS
    CM -->|references| MS
    CM -->|references| SH
    CM -->|references| PR
    CM -->|references| CR
    CM -->|references| AR
    CM -->|references| SL

    TS -->|PPL queries via curl| OS
    LS -->|PPL queries via curl| OS
    CR -->|PPL queries via curl| OS
    CR -->|PromQL + exemplars via curl| PM
    AR -->|PromQL RED queries via curl| PM
    AR -->|PPL RED queries via curl| OS
    SL -->|PromQL SLO queries via curl| PM
    SH -->|health checks via curl| OS
    SH -->|health checks via curl| PM
    SH -->|health checks via curl| OC
    MS -->|PromQL queries via curl| PM
    PR -->|PPL reference for| OS

    TR -->|validates commands from| FX
    CF -->|checks health of| OS
    CF -->|checks health of| PM
Loading

Data Flow

flowchart LR
    A[User asks Claude Code<br/>an observability question] --> B[Claude Code reads CLAUDE.md]
    B --> C{Route by intent}
    C -->|trace investigation| D[Load traces.md]
    C -->|log search| E[Load logs.md]
    C -->|metrics query| F[Load metrics.md]
    C -->|stack issues| G[Load stack-health.md]
    C -->|PPL syntax help| H[Load ppl-reference.md]
    C -->|cross-signal correlation| X[Load correlation.md]
    C -->|RED metrics / APM| Y[Load apm-red.md]
    C -->|SLO/SLI / error budget| Z[Load slo-sli.md]
    D --> I[Execute curl command<br/>against OpenSearch PPL API]
    E --> I
    F --> J[Execute curl command<br/>against Prometheus API]
    G --> K[Execute curl/docker commands<br/>against stack endpoints]
    H --> L[Reference for constructing<br/>novel PPL queries]
    X --> I
    X --> J
    Y --> I
    Y --> J
    Z --> J
Loading

What's Included

Eight Skill Files

The plugin ships as a CLAUDE.md entry point plus eight skill files in a skills/ directory:

Skill What it does Query language Target
traces.md Query trace spans: agent invocations, tool executions, slow spans, errors, token usage, trace tree reconstruction, cross-signal correlation PPL OpenSearch :9200
logs.md Query logs: severity filtering, trace correlation, error patterns, log volume, body search PPL OpenSearch :9200
metrics.md Query metrics: HTTP rates, latency percentiles, error rates, GenAI token usage, operation duration PromQL Prometheus :9090
stack-health.md Health checks for all stack components, troubleshooting guide, port reference curl + docker All services
ppl-reference.md Comprehensive PPL language reference: 50+ commands, 14 function categories, 3 API endpoints n/a Reference
correlation.md Cross-signal correlation: trace-log joins via PPL, metric-to-trace via Prometheus exemplars, resource-level correlation, investigation workflows PPL + PromQL OpenSearch + Prometheus
apm-red.md APM RED metrics: per-service request rate, error ratio, latency percentiles (p50/p95/p99), GenAI RED, OTel HTTP semantic conventions PromQL + PPL Prometheus + OpenSearch
slo-sli.md SLO/SLI monitoring: SLI definitions, Prometheus recording rules, error budgets, multi-window burn rate alerts, compliance reporting PromQL Prometheus :9090

Plugin Directory Structure

claude-code-observability-plugin/
├── CLAUDE.md                    # Entry point, routing table for skills
├── skills/
│   ├── traces.md                # Trace querying with PPL
│   ├── logs.md                  # Log querying with PPL
│   ├── metrics.md               # Metrics querying with PromQL
│   ├── stack-health.md          # Health checks and troubleshooting
│   ├── ppl-reference.md         # Comprehensive PPL language reference
│   ├── correlation.md           # Cross-signal correlation workflows
│   ├── apm-red.md               # APM RED metrics (Rate, Errors, Duration)
│   └── slo-sli.md               # SLO/SLI definitions, error budgets, burn rates
└── tests/
    ├── README.md                # Test documentation
    ├── conftest.py              # Session fixtures, stack health gate
    ├── test_runner.py           # YAML-driven test execution
    ├── models.py                # Pydantic test fixture model
    ├── requirements.txt         # pytest, pyyaml, pydantic, requests
    └── fixtures/
        ├── traces.yaml          # Trace skill test cases
        ├── logs.yaml            # Log skill test cases
        ├── metrics.yaml         # Metrics skill test cases
        ├── stack-health.yaml    # Stack health test cases
        ├── ppl.yaml             # PPL reference test cases
        ├── correlation.yaml     # Correlation skill test cases
        ├── apm-red.yaml         # APM RED skill test cases
        └── slo-sli.yaml         # SLO/SLI skill test cases

Skill File Format

Each skill file follows the Claude Code CLAUDE.md convention:

---
name: <skill-name>
description: <one-line summary>
allowed-tools:
  - Bash
  - curl
---

Every query template is a complete, copy-paste-ready curl command with:

  • Correct protocol (HTTPS for OpenSearch, HTTP for Prometheus)
  • Authentication (-u admin:'My_password_123!@#' for OpenSearch, none for Prometheus)
  • Certificate skip (-k for development)
  • Proper JSON body with PPL/PromQL query
  • Backtick escaping for dotted field names in PPL

Requirements

Requirement 1: Plugin Directory Structure

As a developer, I want the plugin organized as a directory of skill files with a top-level CLAUDE.md entry point, so that Claude Code automatically loads the observability capabilities when I work in the project.

  • The plugin contains a top-level CLAUDE.md that references all skill files
  • Skill files live in a single skills/ directory
  • Eight skill files: traces, logs, metrics, stack-health, ppl-reference, correlation, apm-red, and slo-sli
  • Each skill file includes frontmatter with name, description, and allowed-tools

Requirement 2: Traces Skill

As a developer, I want to query trace data from OpenSearch using PPL, so that I can investigate agent invocations, tool executions, slow spans, error spans, and token usage.

  • PPL query templates for agent invocation spans (attributes.gen_ai.operation.name = invoke_agent)
  • PPL query templates for tool execution spans (attributes.gen_ai.operation.name = execute_tool)
  • Slow span detection where durationInNanos exceeds a configurable threshold
  • Error span identification where status.code = 2
  • Token usage aggregation by model and by agent name
  • Service operation listing with GenAI operation type breakdown
  • Service map queries for dependency exploration
  • All GenAI attributes documented with descriptions and example values
  • Every PPL query includes the complete curl command with endpoint, auth, and escaping

Requirement 3: Logs Skill

As a developer, I want to query log data from OpenSearch using PPL, so that I can search logs by severity, correlate logs with traces, identify error patterns, and analyze log volume.

  • Severity-based filtering (ERROR, WARN, INFO)
  • Trace-to-log correlation via traceId
  • Error pattern identification with stats count() by aggregations
  • Log volume trending over time with span(time, <interval>)
  • Full-text body search with string matching or relevance functions
  • Log Index field reference: severityText, severityNumber, traceId, spanId, serviceName, body, @timestamp

Requirement 4: Metrics Skill

As a developer, I want to query metrics from Prometheus using PromQL, so that I can monitor HTTP request rates, latency percentiles, error rates, and active connections.

  • HTTP request rate per second grouped by service
  • HTTP latency at p95 and p99 by service
  • HTTP error rate (5xx) as a ratio
  • Active HTTP connections by service
  • Database operation latency at p95
  • Every PromQL query includes the complete curl command targeting localhost:9090/api/v1/query
  • Note on PPL as alternative for OpenSearch-ingested metrics

Requirement 5: Stack Health Skill

As a developer, I want to check the health of all observability stack components and troubleshoot common issues, so that I can verify the stack is operational and diagnose data flow problems.

  • Health check curl commands for OpenSearch, Prometheus, OTel Collector
  • Index listing and document count verification
  • Docker compose commands for container status and logs
  • Troubleshooting section for common failures: OpenSearch unreachable, no data in indices, Data Prepper pipeline errors, OTel Collector export failures
  • Port reference: OpenSearch (9200), OTel Collector gRPC (4317), OTel Collector HTTP (4318), Data Prepper (21890), Prometheus (9090), OpenSearch Dashboards (5601)
  • PPL describe for index mapping inspection
  • PPL _explain endpoint for query plan debugging

Requirement 6: PPL Reference Skill

As a developer, I want a comprehensive PPL language reference available to Claude Code, so that Claude Code can understand PPL syntax and construct correct queries for any observability question.

Commands (50+):

  • Core query: search, source, where, fields, stats, sort, head, eval, dedup, rename, top, rare, table
  • Time-series: timechart, chart, bin, trendline, streamstats, eventstats
  • Parse/extract: parse, grok, rex, regex, patterns, spath
  • Join/lookup: join, lookup, graphlookup, subquery, append, appendcol, appendpipe
  • Transform: fillnull, flatten, expand, transpose, convert, replace, reverse
  • Multi-value: mvexpand, mvcombine, nomv
  • Aggregation/totals: addcoltotals, addtotals
  • ML: ad (anomaly detection), kmeans, ml
  • System: describe, explain, showdatasources, multisearch
  • Display: fieldformat

Functions (14 categories):

  • Aggregation: COUNT, SUM, AVG, MAX, MIN, VAR_SAMP, VAR_POP, STDDEV_SAMP, STDDEV_POP, DISTINCT_COUNT, PERCENTILE, EARLIEST, LATEST, LIST, VALUES, FIRST, LAST
  • Collection: ARRAY, SPLIT, MVJOIN, MVCOUNT, MVINDEX, MVFIRST, MVLAST, MVAPPEND, MVDEDUP, MVSORT, MVZIP, MVRANGE, MVFILTER
  • Condition: ISNULL, ISNOTNULL, IF, IFNULL, NULLIF, CASE, COALESCE, LIKE, IN, BETWEEN
  • Conversion: CAST, TOSTRING, TONUMBER, TOINT, TOLONG, TOFLOAT, TODOUBLE, TOBOOLEAN
  • Cryptographic: MD5, SHA1, SHA2
  • Datetime: NOW, CURDATE, CURTIME, DATE_FORMAT, DATE_ADD, DATE_SUB, DATEDIFF, DAY, MONTH, YEAR, HOUR, MINUTE, SECOND, DAYOFWEEK, DAYOFYEAR, WEEK, UNIX_TIMESTAMP, FROM_UNIXTIME, and more
  • Expressions: arithmetic (+, -, *, /), comparison (=, !=, <, >, <=, >=), logical (AND, OR, NOT, XOR)
  • IP: CIDRMATCH, GEOIP
  • JSON: JSON_EXTRACT, JSON_KEYS, JSON_VALID, JSON_ARRAY, JSON_OBJECT, JSON_ARRAY_LENGTH, JSON_EXTRACT_PATH_TEXT, TO_JSON_STRING
  • Math: ABS, CEIL, FLOOR, ROUND, SQRT, POW, MOD, LOG, LOG2, LOG10, LN, EXP, and more
  • Relevance: MATCH, MATCH_PHRASE, MULTI_MATCH, QUERY_STRING, SIMPLE_QUERY_STRING, HIGHLIGHT, SCORE, WILDCARD_QUERY
  • Statistical: CORR, COVAR_POP, COVAR_SAMP
  • String: CONCAT, LENGTH, LOWER, UPPER, TRIM, SUBSTRING, REPLACE, REGEXP, REGEXP_EXTRACT, REGEXP_REPLACE, and more
  • System: TYPEOF

API Endpoints:

  • Query execution: POST /_plugins/_ppl with JSON body {"query": "<ppl_query>"}
  • Query explain: POST /_plugins/_ppl/_explain
  • Grammar metadata: GET /_plugins/_ppl/_grammar

Source: Grammar reference sourced from the opensearch-project/sql repository's docs/user/ppl/ directory.

Requirement 7: Skill File Format Compliance

  • Each skill file is valid markdown with YAML frontmatter delimited by ---
  • Frontmatter contains name, description, and allowed-tools fields
  • Top-level CLAUDE.md references each skill file path with a one-line summary
  • Credentials sourced from .env file (admin / My_password_123!@#), noted as configurable

Requirement 8: Authentication and Connection Details

Service Protocol Port Auth
OpenSearch (local) HTTPS 9200 Basic auth (admin / My_password_123!@#), -k flag for cert skip
OpenSearch (AWS managed) HTTPS 443 AWS SigV4 (--aws-sigv4 "aws:amz:REGION:es")
Prometheus (local) HTTP 9090 None
Prometheus (AWS managed) HTTPS 443 AWS SigV4 (--aws-sigv4 "aws:amz:REGION:aps")
OTel Collector HTTP 4317 (gRPC), 4318 (HTTP) None
Data Prepper HTTP 21890 None
OpenSearch Dashboards HTTP 5601 Same as OpenSearch

All credentials are sourced from the repository .env file. The test harness reads .env with fallback to these defaults.

Skill files provide curl command variants for both local and AWS managed endpoints. The CLAUDE.md entry point includes a configuration section where users set $OPENSEARCH_ENDPOINT and $PROMETHEUS_ENDPOINT environment variables to switch between local and managed services. PPL and PromQL query syntax is identical across both profiles; only the endpoint URL and authentication method differ.

Requirement 9: PPL Grammar Source Documentation

  • Grammar reference sourced from opensearch-project/sql repository's docs/user/ppl/ directory
  • Repository URL included: https://github.com/opensearch-project/sql
  • Commands organized into logical categories
  • Functions organized into categories matching the source repository

Requirement 10: Cross-Signal Correlation and GenAI Debugging

As a developer, I want the plugin skills to support cross-signal correlation between traces, logs, and metrics, and provide GenAI-specific debugging capabilities, so that I can perform end-to-end observability investigations across all telemetry signals.

Cross-signal correlation:

  • Trace-to-log joins by matching traceId across Trace Index and Log Index
  • Log-to-span correlation by spanId
  • Full trace tree reconstruction by traceId with parentSpanId hierarchy
  • Latency gap analysis between parent and child spans
  • Root span identification where parentSpanId is empty or null

GenAI operation types (beyond invoke_agent and execute_tool):

  • chat, embeddings, retrieval, create_agent, text_completion, generate_content

Exception and error querying:

  • Span events with exception.type, exception.message, exception.stacktrace
  • Spans with error.type for error categorization
  • Exception-to-log correlation via shared traceId and spanId

Extended GenAI attributes:

  • gen_ai.agent.id, gen_ai.agent.description, gen_ai.agent.version
  • gen_ai.conversation.id for multi-turn conversation tracking
  • gen_ai.tool.call.id, gen_ai.tool.type, gen_ai.tool.call.arguments, gen_ai.tool.call.result

GenAI-specific metrics:

  • gen_ai_client_token_usage histogram grouped by operation and model
  • gen_ai_client_operation_duration histogram grouped by operation and model

Requirement 11: Integration Test Harness

As a developer, I want an integration test suite that validates all skill file commands against a running observability stack, so that I can verify the plugin's queries and health checks produce correct results.

Test infrastructure:

  • pytest test suite in a tests/ directory within the plugin
  • YAML fixture files defining test cases with command, expected_status_code, expected_fields, and tags
  • Pydantic model for strict schema validation (extra="forbid")
  • Session-scoped fixture that checks stack health before tests run
  • All tests skipped with clear message if stack is not running

Test categories:

  • traces: PPL queries against Trace Index, validate schema and datarows in response
  • logs: PPL queries against Log Index, validate response structure
  • metrics: PromQL queries against Prometheus, validate status: "success" and data field
  • stack-health: Health check commands, validate HTTP 200 status codes
  • ppl: PPL system commands (describe, _explain), validate response structure
  • correlation: Cross-signal correlation queries, validate join results and exemplar responses
  • apm_red: RED metric queries against Prometheus and OpenSearch, validate rate/error/duration responses
  • slo_sli: SLO/SLI queries against Prometheus, validate recording rule outputs and burn rate calculations

Test execution:

  • Commands executed via subprocess.run with configurable timeout (default 30s)
  • JSON response parsing with recursive field lookup for expected_fields
  • pytest markers for tag-based filtering (pytest -m traces)
  • before_test and after_test hooks in YAML for setup/teardown scripts

Configuration:

  • Connection details read from .env with fallback defaults
  • Dependencies: pytest, pyyaml, pydantic, requests, hypothesis
  • README documenting how to run tests, prerequisites, and how to add new test cases

Requirement 12: Correlation Skill

As a developer, I want a dedicated correlation skill that teaches Claude Code how to join traces, logs, and metrics across all three telemetry signals using OTel semantic convention correlation fields, so that I can perform end-to-end investigations starting from any signal.

OTel correlation fields (sourced from opentelemetry.io):

The OTel specification defines three correlation mechanisms across signals:

Mechanism Fields Signals Connected How It Works
Trace context traceId, spanId, traceFlags Traces + Logs Both span records and log records carry the same traceId/spanId, enabling direct joins
Exemplars trace_id, span_id, filtered_attributes Metrics + Traces Prometheus exemplars attach trace context to individual metric samples
Resource attributes service.name, service.namespace, service.version, service.instance.id All three signals Every span, metric data point, and log record from the same service carries identical resource attributes

GenAI resource attributes promoted to Prometheus labels in this stack:

  • gen_ai.agent.id, gen_ai.agent.name, gen_ai.provider.name, gen_ai.request.model, gen_ai.response.model
  • These are configured in docker-compose/prometheus/prometheus.yml under otlp.promote_resource_attributes
  • This enables PromQL queries filtered by agent or model that can then be correlated to traces via exemplars

Trace-to-log correlation (PPL):

  • Find all logs for a trace: source=otel-v1-apm-log-* | WHERE traceId = '<id>'
  • Find logs for a specific span: source=otel-v1-apm-log-* | WHERE spanId = '<id>'
  • Join spans with logs: PPL join across Trace Index and Log Index on traceId
  • Full timeline reconstruction: all spans + all logs for a traceId, sorted by timestamp

Log-to-trace correlation (PPL):

  • From an error log, extract traceId and query the Trace Index for the full trace tree
  • From a log entry, extract spanId and find the exact span that produced it

Metric-to-trace correlation (PromQL + exemplars):

  • Query Prometheus exemplars API: GET /api/v1/query_exemplars?query=<metric>&start=<start>&end=<end>
  • Extract trace_id from exemplar, then query Trace Index via PPL
  • Filter metrics by GenAI labels (gen_ai_agent_name, gen_ai_request_model), then correlate to traces

Resource-level correlation:

  • serviceName in traces/logs maps to service_name label in Prometheus metrics
  • Query all signals for a specific service to get the complete picture

Investigation workflows:

  • Metric spike investigation: PromQL anomaly detection, exemplars, trace tree, correlated logs
  • Error log investigation: find error logs, extract traceId, reconstruct trace, identify root cause span
  • Slow agent investigation: find slow invoke_agent spans, get child spans, correlated logs, token usage metrics

Requirement 13: APM/RED Metrics Skill

As a developer, I want a dedicated APM skill that teaches Claude Code how to construct RED (Rate, Errors, Duration) metrics queries for any service, so that I can quickly assess service health using the standard APM methodology.

  • Rate queries: per-service request rate via PromQL (rate(http_server_duration_seconds_count[5m])), per-endpoint rate, and PPL alternative from trace spans
  • Error queries: error rate as a ratio (5xx / total) via PromQL, error count from trace spans via PPL (status.code = 2)
  • Duration queries: latency percentiles (p50, p95, p99) via PromQL histogram_quantile and PPL percentile() from trace spans
  • Combined RED dashboard query set for all services in a single investigation workflow
  • GenAI-specific RED metrics using gen_ai_client_operation_duration histogram
  • OTel HTTP semantic convention metrics reference: http.server.request.duration (histogram), http.server.active_requests (gauge), and their Prometheus-exported equivalents
  • OTel Collector spanmetrics connector documentation for auto-generating RED metrics from traces
  • Every query template includes the complete curl command with the appropriate endpoint and authentication

Requirement 14: SLO/SLI Skill

As a developer, I want a dedicated SLO/SLI skill that teaches Claude Code how to define SLIs, calculate error budgets, and construct burn rate queries using Prometheus recording rules, so that I can implement and monitor service level objectives for my services.

  • SLI definition templates: availability SLI (successful/total ratio), latency SLI (within-threshold/total ratio), GenAI-specific SLI
  • Prometheus recording rule YAML templates for pre-computing SLIs at multiple time windows (5m, 30m, 1h, 6h, 1d, 3d, 30d)
  • Recording rule naming conventions: sli:http_availability:ratio_rate<window>, sli:http_latency:ratio_rate<window>
  • Error budget calculation: remaining budget given an SLO target, consumption rate, common SLO targets (99.9%, 99.5%, 99.0%) with allowed downtime
  • Burn rate queries: single-window and multi-window (Google SRE book pattern: 14.4x fast burn 1h/6h, 1x slow burn 3d/30d)
  • Prometheus alerting rule YAML templates for burn rate alerts
  • SLO compliance reporting: current SLI value, SLO target, error budget remaining, burn rate per service
  • Step-by-step SLO setup workflow: define SLIs, add recording rules, set targets, add burn rate alerts, query compliance
  • Every query template includes the complete curl command with the appropriate Prometheus endpoint and authentication

Data Models

OpenSearch Trace Index Schema (otel-v1-apm-span-*)

Field Type Description
traceId keyword Unique trace identifier
spanId keyword Unique span identifier
parentSpanId keyword Parent span ID (empty for root)
serviceName keyword Service that produced the span
name text Span operation name
kind keyword Span kind (SERVER, CLIENT, INTERNAL, etc.)
startTime date Span start timestamp
endTime date Span end timestamp
durationInNanos long Span duration in nanoseconds
status.code integer Status code (0=Unset, 1=Ok, 2=Error)
attributes.gen_ai.operation.name keyword GenAI operation type
attributes.gen_ai.agent.name keyword Agent name
attributes.gen_ai.agent.id keyword Agent identifier
attributes.gen_ai.request.model keyword Requested model
attributes.gen_ai.usage.input_tokens long Input token count
attributes.gen_ai.usage.output_tokens long Output token count
attributes.gen_ai.tool.name keyword Tool name
attributes.gen_ai.tool.call.id keyword Tool call identifier
attributes.gen_ai.tool.call.arguments text Tool call arguments
attributes.gen_ai.tool.call.result text Tool call result
attributes.gen_ai.conversation.id keyword Conversation identifier
events.attributes.exception.type keyword Exception type
events.attributes.exception.message text Exception message
events.attributes.exception.stacktrace text Exception stacktrace

OpenSearch Log Index Schema (otel-v1-apm-log-*)

Field Type Description
traceId keyword Correlated trace identifier
spanId keyword Correlated span identifier
severityText keyword Log level (ERROR, WARN, INFO, DEBUG)
severityNumber integer Numeric severity
serviceName keyword Service that produced the log
body text Log message body
@timestamp date Log timestamp

OpenSearch Service Map Index (otel-v2-apm-service-map)

Field Type Description
serviceName keyword Source service
destination.domain keyword Destination service
destination.resource keyword Destination resource
traceGroupName keyword Trace group

Prometheus Metrics

Metric Type Labels
http_server_duration_seconds histogram service_name, http_response_status_code
http_server_active_requests gauge service_name
db_client_operation_duration_seconds histogram service_name
gen_ai_client_token_usage histogram gen_ai.operation.name, gen_ai.request.model
gen_ai_client_operation_duration histogram gen_ai.operation.name, gen_ai.request.model

Connection Profiles

Profile OpenSearch Endpoint OpenSearch Auth Prometheus Endpoint Prometheus Auth
Local https://localhost:9200 Basic auth (-u admin:'My_password_123!@#' -k) http://localhost:9090 None
AWS Managed https://DOMAIN-ID.REGION.es.amazonaws.com AWS SigV4 (--aws-sigv4 "aws:amz:REGION:es") https://aps-workspaces.REGION.amazonaws.com/workspaces/WORKSPACE_ID AWS SigV4 (--aws-sigv4 "aws:amz:REGION:aps")

Test Fixture YAML Schema

- name: "agent_invocations"
  description: "Query all agent invocation spans"
  command: |
    curl -sk -u admin:'My_password_123!@#' \
      -X POST https://localhost:9200/_plugins/_ppl \
      -H 'Content-Type: application/json' \
      -d '{"query": "source=otel-v1-apm-span-* | WHERE `attributes.gen_ai.operation.name` = '\''invoke_agent'\'' | head 10"}'
  expected_status_code: 200
  expected_fields: ["schema", "datarows"]
  tags: ["traces"]
  before_test: null
  after_test: null

Design Decisions

Why a flat skills/ directory?
Eight files don't need subdirectories. Flat is simpler to reference from CLAUDE.md and easier for contributors to navigate.

Why complete curl commands instead of just query bodies?
Claude Code can execute curl directly via its Bash tool. Including the full command (endpoint, auth, headers, body) means zero assembly required. The skill file is the executable documentation.

Why a dedicated PPL reference file?
The PPL grammar is large (50+ commands, 14 function categories). Inlining it into traces.md or logs.md would bloat those files. As a separate skill, Claude Code loads it on demand when it needs to construct a novel query.

Why YAML test fixtures instead of inline pytest?
Declarative YAML fixtures are easier for contributors to add (no Python knowledge needed to add a test case). The Pydantic schema catches malformed fixtures at load time. This pattern is proven at scale in HolmesGPT's test suite.

Why read credentials from .env?
The observability stack already centralizes configuration in .env. The plugin and test harness reuse the same source of truth rather than duplicating credentials.

Error Handling

Skill File Errors

Scenario Handling
OpenSearch unreachable Stack health skill provides diagnostic steps: check docker compose ps, verify port 9200, check health endpoint
Prometheus unreachable Stack health skill suggests checking container status and port 9090
PPL query syntax error PPL reference skill provides syntax guidance; _explain endpoint helps debug query plans
Authentication failure Skill files document correct credentials from .env; stack health skill suggests verifying credentials
No data in indices Stack health skill provides index listing commands and document count verification
Data Prepper pipeline errors Stack health skill suggests checking Data Prepper logs via docker compose logs data-prepper
OTel Collector export failures Stack health skill suggests checking collector metrics at port 8888 and logs

Test Harness Errors

Scenario Handling
Stack not running Session-scoped fixture detects this and skips all tests with clear message
Curl command timeout Configurable timeout (default 30s); test fails with timeout error
Invalid YAML fixture Pydantic model with extra="forbid" raises validation error at load time
Unexpected JSON response Test reports which expected_fields were missing from the response
Hook failure Test reports before_test/after_test hook failure separately from the main command result
Missing .env file Config loader falls back to hardcoded defaults

Running the Tests

Prerequisites: the observability stack must be running (docker compose up -d).

cd claude-code-observability-plugin/tests

# Install dependencies
pip install -r requirements.txt

# Run all tests
pytest

# Run by category
pytest -m traces
pytest -m logs
pytest -m metrics
pytest -m stack_health
pytest -m ppl

# Verbose output
pytest -v --tb=short

If the stack is not running, all tests are skipped with a clear message.

Open Questions

  1. Plugin location: Should the plugin live at the repo root (claude-code-observability-plugin/) or under a new plugins/ directory?

  2. Versioning: Should the plugin version track the observability stack version, or have its own independent version?

  3. Additional AI assistants: The skill file format is Claude Code-specific (CLAUDE.md convention). Should we also provide equivalent configurations for other AI coding assistants (e.g., Cursor rules, Kiro steering)?

  4. Metrics in OpenSearch: The metrics skill currently targets Prometheus. Should we also include PPL queries for metrics stored in OpenSearch (when metrics are ingested via Data Prepper)?

  5. Example telemetry data: Should the test harness include a script that sends sample telemetry data to the stack, so tests can validate queries return actual results rather than just valid empty responses?

How to Contribute

Adding a new query template to a skill file:

  1. Add the curl command to the appropriate skills/*.md file
  2. Add a corresponding test fixture in tests/fixtures/*.yaml
  3. Run pytest to verify the command works against a running stack

Adding a new test case:

  1. Create a YAML entry in the appropriate tests/fixtures/*.yaml file
  2. Follow the schema: name, description, command, expected_status_code, expected_fields, tags
  3. Run pytest -m <tag> to verify

Feedback Requested

We'd like feedback on:

  • The skill file organization and routing approach
  • Which query templates are most valuable for your workflow
  • The open questions above
  • Any missing capabilities or query patterns you'd want included
  • The integration test approach and fixture format

Please comment on this RFC or open an issue with your thoughts.

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions