Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 21 additions & 0 deletions .claude-plugin/marketplace.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
{
"name": "observability-stack",
"owner": {
"name": "OpenSearch Project",
"email": "anirudha@nyu.edu"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we change this to some OS email alias?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keeping personal email for now until we finalize a proper OS alias. Will update once decided.

},
"metadata": {
"description": "Observability plugins for the OpenSearch stack"
},
"plugins": [
{
"name": "observability",
"source": "./claude-code-observability-plugin",
"description": "Query and investigate traces, logs, and metrics from an OpenSearch-based observability stack using PPL and PromQL",
"version": "1.0.0",
"author": {
"name": "OpenSearch Project"
}
}
]
}
47 changes: 47 additions & 0 deletions .github/workflows/claude-code-plugin-release.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
name: Claude Code Plugin Release

on:
release:
types: [published]
workflow_dispatch:

jobs:
build-plugin-zips:
name: Build Plugin ZIPs
runs-on: ubuntu-latest
permissions:
contents: write
steps:
- uses: actions/checkout@v4

- name: Build skill ZIP files
run: |
PLUGIN_DIR=claude-code-observability-plugin
DIST_DIR=$PLUGIN_DIR/dist
mkdir -p "$DIST_DIR"

for skill_dir in "$PLUGIN_DIR"/skills/*/; do
skill_name=$(basename "$skill_dir")
if [ -f "$skill_dir/SKILL.md" ]; then
zip -j "$DIST_DIR/${skill_name}.zip" "$skill_dir/SKILL.md"
echo "Built $DIST_DIR/${skill_name}.zip"
fi
done

ls -la "$DIST_DIR"

- name: Upload ZIPs as artifacts
uses: actions/upload-artifact@v4
with:
name: claude-code-plugin-skills
path: claude-code-observability-plugin/dist/*.zip

- name: Attach ZIPs to release
if: github.event_name == 'release'
env:
GH_TOKEN: ${{ github.token }}
run: |
for zip in claude-code-observability-plugin/dist/*.zip; do
gh release upload "${{ github.event.release.tag_name }}" "$zip" --clobber
echo "Uploaded $(basename $zip) to release ${{ github.event.release.tag_name }}"
done
13 changes: 13 additions & 0 deletions claude-code-observability-plugin/.claude-plugin/plugin.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
{
"name": "opensearch@observability",
"version": "1.0.0",
"description": "Query and investigate traces, logs, and metrics from an OpenSearch-based observability stack using PPL and PromQL",
"author": {
"name": "OpenSearch Project",
"url": "https://github.com/opensearch-project/observability-stack"
},
"homepage": "https://observability.opensearch.org/docs/claude-code/",
"repository": "https://github.com/opensearch-project/observability-stack",
"license": "Apache-2.0",
"keywords": ["observability", "opensearch", "traces", "logs", "metrics", "ppl", "promql", "opentelemetry"]
}
103 changes: 103 additions & 0 deletions claude-code-observability-plugin/CLAUDE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,103 @@
# OpenSearch Observability Plugin for Claude Code

This plugin teaches Claude Code how to query and investigate traces, logs, and metrics from an OpenSearch-based observability stack. It provides nine skill files containing PPL (Piped Processing Language) query templates for OpenSearch, PromQL query templates for Prometheus, and curl-based commands — all ready to execute against a running stack.

## Skill Routing Table

Load the appropriate skill file based on the user's intent:

| Skill | When to Use |
|---|---|
| `skills/traces/SKILL.md` | Use when investigating agent invocations, tool executions, slow spans, error spans, token usage, or trace correlation |
| `skills/logs/SKILL.md` | Use when searching logs by severity, correlating logs with traces, identifying error patterns, or analyzing log volume |
| `skills/metrics/SKILL.md` | Use when querying HTTP request rates, latency percentiles, error rates, active connections, or GenAI metrics |
| `skills/stack-health/SKILL.md` | Use when checking stack component health, troubleshooting data flow issues, or verifying service status |
| `skills/ppl-reference/SKILL.md` | Use when constructing novel PPL queries, looking up PPL syntax, or understanding PPL functions |
| `skills/correlation/SKILL.md` | Use when performing cross-signal correlation between traces, logs, and metrics |
| `skills/apm-red/SKILL.md` | Use when analyzing RED metrics (Rate, Errors, Duration) for service-level monitoring |
| `skills/slo-sli/SKILL.md` | Use when defining SLOs/SLIs, calculating error budgets, or setting up burn rate alerts |
| `skills/osd-config/SKILL.md` | Use when discovering index patterns, workspaces, saved objects, APM configs, or field mappings from OpenSearch Dashboards or OpenSearch APIs |

## Configuration

### Environment Variables

Set these environment variables to override default endpoints:

- `$OPENSEARCH_ENDPOINT` — OpenSearch base URL (default: `https://localhost:9200`)
- `$PROMETHEUS_ENDPOINT` — Prometheus base URL (default: `http://localhost:9090`)

### Connection Profiles

#### Local Stack (Default)

| Service | Endpoint | Auth |
|---|---|---|
| OpenSearch | `https://localhost:9200` | `-u admin:'My_password_123!@#' -k` (HTTPS + basic auth, skip TLS verify) |
| Prometheus | `http://localhost:9090` | None (HTTP, no auth) |

Example OpenSearch curl:

```bash
curl -sk -u admin:'My_password_123!@#' \
-X POST https://localhost:9200/_plugins/_ppl \
-H 'Content-Type: application/json' \
-d '{"query": "source=otel-v1-apm-span-* | head 10"}'
```

Example Prometheus curl:

```bash
curl -s 'http://localhost:9090/api/v1/query' \
--data-urlencode 'query=up'
```

#### AWS Managed Services

##### Amazon OpenSearch Service

- Endpoint format: `https://DOMAIN-ID.REGION.es.amazonaws.com`
- Auth: AWS Signature Version 4

```bash
curl -s --aws-sigv4 "aws:amz:REGION:es" \
--user "$AWS_ACCESS_KEY_ID:$AWS_SECRET_ACCESS_KEY" \
-X POST https://DOMAIN-ID.REGION.es.amazonaws.com/_plugins/_ppl \
-H 'Content-Type: application/json' \
-d '{"query": "source=otel-v1-apm-span-* | head 10"}'
```

##### Amazon Managed Service for Prometheus (AMP)

- Endpoint format: `https://aps-workspaces.REGION.amazonaws.com/workspaces/WORKSPACE_ID/api/v1/query`
- Auth: AWS Signature Version 4

```bash
curl -s --aws-sigv4 "aws:amz:REGION:aps" \
--user "$AWS_ACCESS_KEY_ID:$AWS_SECRET_ACCESS_KEY" \
'https://aps-workspaces.REGION.amazonaws.com/workspaces/WORKSPACE_ID/api/v1/query' \
--data-urlencode 'query=up'
```

> **Note:** PPL and PromQL query syntax is identical across local and AWS managed profiles. Only the endpoint URL and authentication method differ.

## Port Reference

| Component | Port | Protocol |
|---|---|---|
| OpenSearch | 9200 | HTTPS |
| OTel Collector (gRPC) | 4317 | gRPC |
| OTel Collector (HTTP) | 4318 | HTTP |
| Data Prepper | 21890 | HTTP |
| Prometheus | 9090 | HTTP |
| OpenSearch Dashboards | 5601 | HTTP |

## Index Patterns

| Signal | Index Pattern | Key Fields |
|---|---|---|
| Traces | `otel-v1-apm-span-*` | `traceId`, `spanId`, `serviceName`, `name`, `durationInNanos`, `status.code`, `attributes.gen_ai.*` |
| Logs | `logs-otel-v1-*` | `traceId`, `spanId`, `severityText`, `body`, `resource.attributes.service.name`, `@timestamp` |
| Service Maps | `otel-v2-apm-service-map-*` | `sourceNode`, `targetNode`, `sourceOperation`, `targetOperation` |

> **Note:** The log index uses `resource.attributes.service.name` (backtick-quoted in PPL) instead of `serviceName`. The trace span index has a top-level `serviceName` field.
151 changes: 151 additions & 0 deletions claude-code-observability-plugin/docs/INSTALL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,151 @@
# Installation Guide

## Prerequisites

1. **Claude Code CLI** — Install from [claude.ai/claude-code](https://claude.ai/claude-code)
2. **Running Observability Stack** — The plugin queries a local OpenSearch + Prometheus stack

### Start the Observability Stack

```bash
git clone https://github.com/opensearch-project/observability-stack.git
cd observability-stack
docker compose up -d
```

Verify services are running:

```bash
# OpenSearch (should return cluster health JSON)
curl -sk -u 'admin:My_password_123!@#' https://localhost:9200/_cluster/health?pretty

# Prometheus (should return "Prometheus Server is Healthy.")
curl -s http://localhost:9090/-/healthy
```

## Install the Plugin

From the `observability-stack` repository root:

```bash
claude install-plugin ./claude-code-observability-plugin
```

Or install directly from GitHub:

```bash
claude install-plugin https://github.com/opensearch-project/observability-stack/tree/main/claude-code-observability-plugin
```

## Verify Installation

Start Claude Code and try a query:

```
claude
> Show me the top 10 services by trace span count
```

Claude should execute a PPL query against OpenSearch and return results. You can also try:

```
> Check the health of the observability stack
> Show me error logs from the last hour
> What is the p95 latency for all services?
```

## Configuration

### Default Endpoints

| Service | Endpoint | Auth |
|---|---|---|
| OpenSearch | `https://localhost:9200` | `admin` / `My_password_123!@#` (HTTPS, `-k` flag) |
| Prometheus | `http://localhost:9090` | None |

### Custom Endpoints

Override defaults with environment variables:

```bash
export OPENSEARCH_ENDPOINT=https://my-opensearch:9200
export PROMETHEUS_ENDPOINT=http://my-prometheus:9090
```

### AWS Managed Services

The plugin supports Amazon OpenSearch Service and Amazon Managed Service for Prometheus. Queries use AWS SigV4 authentication instead of basic auth. See the skill files for AWS-specific curl examples.

## Available Skills

| Skill | Description |
|---|---|
| `traces` | Query trace spans — agent invocations, tool executions, latency, errors |
| `logs` | Search and analyze logs — severity filtering, body search, error patterns |
| `metrics` | Query Prometheus metrics — HTTP rates, latency percentiles, GenAI tokens |
| `stack-health` | Check component health, verify data ingestion, troubleshoot issues |
| `ppl-reference` | Comprehensive PPL syntax reference with observability examples |
| `correlation` | Cross-signal correlation between traces, logs, and metrics |
| `apm-red` | RED metrics (Rate, Errors, Duration) for service monitoring |
| `slo-sli` | SLO/SLI definitions, error budgets, and burn rate alerting |

## Running Tests

```bash
cd claude-code-observability-plugin/tests
pip install -r requirements.txt

# All tests (requires running stack)
pytest -v

# Property tests only (no stack needed)
pytest test_properties.py -v

# Filter by skill
pytest -m traces
pytest -m logs
pytest -m metrics
```

## Troubleshooting

### "Observability stack is not running"

Tests and skills require OpenSearch and Prometheus to be running locally. Start them with:

```bash
docker compose up -d opensearch prometheus
```

### OpenSearch returns "Unauthorized"

Check the password in `.env` matches what you're using. Default: `My_password_123!@#`

### No trace/log data

The observability stack includes example services (canary, weather-agent, travel-planner) that generate telemetry data automatically. Ensure they're running:

```bash
docker compose ps | grep -E "canary|weather|travel"
```

If not running, check that `INCLUDE_COMPOSE_EXAMPLES=docker-compose.examples.yml` is set in `.env`.

### Prometheus OOM / crash-looping

If Prometheus is crash-looping (exit code 137), its WAL may be corrupted. Clear the volume and restart:

```bash
docker compose stop prometheus
docker compose rm -f prometheus
docker volume rm observability-stack_prometheus-data
docker compose up -d prometheus
```

## Index Reference

| Signal | Index Pattern | Key Fields |
|---|---|---|
| Traces | `otel-v1-apm-span-*` | `traceId`, `spanId`, `serviceName`, `name`, `durationInNanos`, `status.code` |
| Logs | `logs-otel-v1-*` | `traceId`, `spanId`, `severityText`, `body`, `resource.attributes.service.name` |
| Service Maps | `otel-v2-apm-service-map-*` | `sourceNode`, `targetNode`, `sourceOperation`, `targetOperation` |
Loading
Loading