Runbook: Agentic Cloud Operator & Incident Investigator

An AI-powered SRE assistant that investigates incidents, executes runbooks, and manages cloud infrastructure using a research-first, hypothesis-driven methodology.

Core Methodology

Source	Contribution
Dexter	Research-first architecture, scratchpad audit trail, skills, graceful limits
Bits AI (Datadog)	Hypothesis branching, causal focus, evidence-based pruning
Organizational Knowledge	Runbooks, post-mortems, architecture docs, service ownership

Investigation Flow

Incident Alert (PagerDuty/OpsGenie)
    ↓
Initial Context Gathering
    ├─ Alert metadata
    ├─ Recent deployments
    ├─ Service dependencies
    └─ Retrieved organizational knowledge
    ↓
Hypothesis Formation (3-5 initial hypotheses)
    ↓
Parallel Hypothesis Testing (targeted queries only)
    ↓
Branch (strong evidence) / Prune (no evidence)
    ↓
Recursive Investigation (max depth: 4)
    ↓
Root Cause Identification + Confidence Score
    ↓
Remediation (with approval for mutations)
    ↓
Scratchpad: Full Audit Trail

Implementation Plan

Phase 1: Project Foundation

Initialize project structure
Create PLAN.md
Set up TypeScript + Bun configuration
Set up ESLint + Prettier
Create base directory structure
Add core dependencies (Anthropic SDK, AWS SDK, etc.)

Phase 2: Core Agent Loop

Phase 3: Hypothesis Engine

Phase 4: Cloud Provider Tools (AWS First)

Phase 5: Safety & Approval System

Phase 6: Observability Tools

Phase 7: Incident Management Integration

Phase 8: Knowledge System

Phase 9: Skills System

Phase 10: CLI Interface

Phase 11: Learning & Suggestions

Implement learning module (src/knowledge/learning/)
- Post-investigation analysis
- Runbook suggestion generation
- Known issue detection (recurring patterns)
Implement knowledge update suggestions
- New runbook drafts
- Runbook update patches
- Post-mortem drafts

Phase 12: Multi-Cloud Expansion (Future)

GCP provider (src/providers/gcp/)
Azure provider (src/providers/azure/)
Kubernetes provider (src/providers/kubernetes/)
Terraform integration (src/providers/terraform/)

Project Structure

runbook/
├── src/
│   ├── agent/
│   │   ├── agent.ts              # Main agent loop
│   │   ├── hypothesis.ts         # Hypothesis tree management
│   │   ├── confidence.ts         # Evidence scoring
│   │   ├── prompts.ts            # Prompt templates
│   │   ├── scratchpad.ts         # Audit trail
│   │   ├── safety.ts             # Mutation controls
│   │   └── types.ts              # Event types
│   ├── providers/
│   │   ├── aws/
│   │   │   ├── client.ts         # AWS SDK wrapper
│   │   │   └── tools/            # EC2, ECS, Lambda, etc.
│   │   ├── gcp/                  # Future
│   │   └── kubernetes/           # Future
│   ├── tools/
│   │   ├── registry.ts           # Tool registration
│   │   ├── skill.ts              # Skill invocation
│   │   ├── aws/
│   │   │   ├── aws-query.ts      # Read-only meta-router
│   │   │   └── aws-mutate.ts     # State changes
│   │   ├── observability/
│   │   │   ├── causal-query.ts   # Hypothesis-targeted queries
│   │   │   ├── cloudwatch.ts
│   │   │   └── datadog.ts
│   │   └── incident/
│   │       ├── pagerduty.ts
│   │       ├── opsgenie.ts
│   │       └── slack.ts
│   ├── knowledge/
│   │   ├── types.ts
│   │   ├── sources/
│   │   │   ├── filesystem.ts
│   │   │   ├── confluence.ts     # Future
│   │   │   └── github.ts         # Future
│   │   ├── indexer/
│   │   │   ├── chunker.ts
│   │   │   ├── embedder.ts
│   │   │   └── metadata.ts
│   │   ├── store/
│   │   │   ├── vector-store.ts
│   │   │   ├── graph-store.ts
│   │   │   └── sqlite.ts
│   │   ├── retriever/
│   │   │   ├── hybrid-search.ts
│   │   │   ├── reranker.ts
│   │   │   └── context-builder.ts
│   │   └── learning/
│   │       ├── suggest-updates.ts
│   │       └── auto-enrich.ts
│   ├── skills/
│   │   ├── registry.ts
│   │   ├── investigate-incident/
│   │   │   └── SKILL.md
│   │   ├── deploy-service/
│   │   │   └── SKILL.md
│   │   ├── scale-service/
│   │   │   └── SKILL.md
│   │   ├── troubleshoot-service/
│   │   │   └── SKILL.md
│   │   └── cost-analysis/
│   │       └── SKILL.md
│   ├── model/
│   │   └── llm.ts                # LLM client with caching
│   ├── hooks/
│   │   └── useAgentRunner.ts     # React hook for CLI
│   ├── utils/
│   │   ├── tokens.ts             # Token counting
│   │   └── config.ts             # Configuration loading
│   └── cli.tsx                   # CLI entry point
├── .runbook/                     # User configuration (gitignored)
│   ├── config.yaml
│   ├── runbooks/                 # Local runbooks
│   ├── knowledge.db              # SQLite + vectors
│   ├── service-graph.json
│   ├── scratchpad/               # Investigation logs
│   └── investigations/           # Investigation trees
├── examples/
│   └── runbooks/                 # Example runbooks
├── package.json
├── tsconfig.json
├── bunfig.toml
├── PLAN.md                       # This file
└── README.md

Configuration Schema

.runbook/config.yaml

# LLM Configuration
llm:
  provider: anthropic  # anthropic | openai
  model: claude-sonnet-4-20250514
  api_key: ${ANTHROPIC_API_KEY}

# Cloud Providers
providers:
  aws:
    enabled: true
    regions: [us-east-1, us-west-2]
    profile: default  # AWS profile or use env vars

# Incident Management
incident:
  pagerduty:
    enabled: true
    api_key: ${PAGERDUTY_API_KEY}
  opsgenie:
    enabled: false
  slack:
    enabled: true
    bot_token: ${SLACK_BOT_TOKEN}

# Knowledge Sources
knowledge:
  sources:
    - type: filesystem
      path: .runbook/runbooks/
      watch: true
    - type: filesystem
      path: ~/.runbook/knowledge/

  store:
    type: local
    path: .runbook/knowledge.db
    embedding_model: text-embedding-3-small

  retrieval:
    top_k: 10
    rerank: true

# Safety
safety:
  require_approval:
    - high_risk
    - critical
  max_mutations_per_session: 5
  cooldown_between_critical_ms: 60000

# Agent
agent:
  max_iterations: 10
  max_hypothesis_depth: 4
  context_threshold_tokens: 100000

Key Design Decisions

1. Hypothesis-Driven Investigation

Rather than gathering all available data, we form hypotheses and test them with targeted queries. This reduces noise and focuses on causal relationships.

2. Graceful Limits

Tool limits warn but never block. The agent can always proceed, but gets warnings to prevent retry loops.

3. Research-First for Mutations

All state-changing operations require prior research to understand current state and impact.

4. Full Audit Trail

Every tool call, hypothesis, and decision is logged to JSONL for compliance and debugging.

5. Knowledge as First-Class Citizen

Organizational runbooks and post-mortems are indexed and retrieved during investigations, not just appended as context.

6. Multi-Cloud Ready

Provider abstraction allows adding GCP, Azure, K8s without changing core agent logic.

Dependencies

Core

bun - Runtime
typescript - Type safety
@langchain/anthropic - LLM integration
@langchain/core - Agent primitives
zod - Schema validation

AWS

@aws-sdk/client-ec2
@aws-sdk/client-ecs
@aws-sdk/client-lambda
@aws-sdk/client-rds
@aws-sdk/client-elasticache
@aws-sdk/client-cloudwatch
@aws-sdk/client-cloudwatch-logs
@aws-sdk/client-iam

Incident Management

node-pagerduty or raw API
@slack/web-api

Knowledge

better-sqlite3 - Local storage
sqlite-vss - Vector search
openai - Embeddings
gray-matter - Frontmatter parsing
marked - Markdown parsing

CLI

ink - React for CLI
ink-spinner - Loading states
commander - Command parsing
chalk - Colors

Success Metrics

Investigation Accuracy: Root cause correctly identified in >80% of incidents
Time to Resolution: Reduce MTTR by providing faster diagnosis
Runbook Coverage: Track which incidents had matching runbooks
Knowledge Freshness: Alert on stale runbooks (>90 days without validation)
Safety: Zero unauthorized mutations, full audit trail

Progress Summary

Completed:

Phase 1: Project Foundation (100%)
Phase 2: Core Agent Loop (100%)
Phase 3: Hypothesis Engine (100% - causal query builder with anti-pattern detection)
Phase 4: AWS Tools (100% - 40+ services with dynamic loading)
Phase 5: Safety Layer (100% - approval flow with Slack integration)
Phase 6: Observability (100% - CloudWatch, Datadog, Prometheus integration)
Phase 7: Incident Management (100% - PagerDuty, OpsGenie, Slack complete)
Phase 8: Knowledge System (95% - FTS, vector embeddings, hybrid search)
Phase 9: Skills (100% - 7 core skills with executor and registry)
Phase 10: CLI Interface (100% - all commands implemented)

New Features:

Multi-AWS account support with assume-role and profiles
Service configuration system for targeted infrastructure scanning
Quick setup templates (ecs-rds, serverless, enterprise)
Interactive setup wizard (runbook init) with step-by-step configuration
Dynamic AWS Service System (40+ services):
- Declarative service definitions with automatic SDK loading
- Query by service ID, category, or all services
- Parallel execution with unified result formatting
- Automatic pagination handling
- Categories: compute, database, storage, networking, security, analytics, integration, devtools, ml, management
Mutation approval flow with risk classification (low/medium/high/critical)
AWS mutations: ECS scaling, EC2 start/stop/reboot, Lambda config updates
Audit trail for all approved/rejected mutations
Interactive chat interface (runbook chat) with conversation history
Datadog integration (metrics, logs, traces, monitors, events)
Skill system with 7 built-in workflows:
- investigate-incident, deploy-service, scale-service
- troubleshoot-service, rollback-deployment
- cost-analysis, security-audit
Skill executor with templating, conditions, and error handling
User-defined skills via YAML in .runbook/skills/
Causal query builder with pattern-based investigation queries
- Detects failure patterns (latency, errors, memory, CPU, etc.)
- Generates targeted queries per hypothesis
- Prevents broad data gathering with anti-pattern detection
Slack integration for incident communication:
- Post investigation updates with rich Block Kit formatting
- Post root cause identification with evidence
- Read channel/thread context for investigation
- Request approval for mutations via Slack buttons
OpsGenie integration:
- Get/list alerts and incidents
- Add investigation notes
- Acknowledge and close alerts
Prometheus integration:
- Instant and range PromQL queries
- Firing alerts monitoring
- Target health checks
- Common metric shortcuts (CPU, memory, disk, network, K8s)
Knowledge system with semantic search:
- OpenAI embeddings for vector similarity
- Hybrid search (FTS + vector) with RRF fusion
- Batch embedding with caching
- Find similar past incidents and runbooks
Complete CLI with all commands:
- runbook deploy with dry-run support
- runbook knowledge add/validate/stats
- runbook config --set for nested config values
Slack approval integration:
- Send approval requests to Slack with buttons
- Race between Slack and CLI approval
- Auto-approval for configured risk levels

GitHub: https://github.com/manthan787/RunbookAI

Next Steps:

Priority 1: Agent Reasoning System (see docs/AGENT_DESIGN.md) - 100% Complete ✅

Total: 156 tests passing across 5 test files

✅ Implement investigation state machine (triage → hypothesize → investigate → evaluate → conclude → remediate)
- src/agent/state-machine.ts with 8 phases, event emitter pattern
- src/agent/__tests__/state-machine.test.ts with 45 passing tests
- Hypothesis tree with max depth 4, max 10 hypotheses
- Evidence evaluation with branch/prune/confirm/continue actions
✅ Add structured output parsing for LLM hypothesis/evidence responses
- src/agent/llm-parser.ts with Zod schemas for all structured outputs
- src/agent/__tests__/llm-parser.test.ts with 27 passing tests
- Schemas: TriageResponse, HypothesisGeneration, EvidenceEvaluation, Conclusion, RemediationPlan, LogAnalysis
- Prompt templates for each structured output type
✅ Add log analysis with pattern extraction
- src/agent/log-analyzer.ts with pattern matching and log parsing
- src/agent/__tests__/log-analyzer.test.ts with 34 passing tests
- 11 error pattern categories (connection, memory, database, auth, K8s, etc.)
- Time range extraction, service mention detection, log filtering
✅ Integrate causal query builder into investigation loop
- src/agent/investigation-orchestrator.ts ties everything together
- src/agent/__tests__/investigation-orchestrator.test.ts with 16 passing tests
- Orchestrates: state machine, LLM parsing, log analysis, query execution
- Event emitter for progress tracking
✅ Wire hypothesis engine lifecycle (create, update, prune, confirm)
- Full lifecycle in orchestrator: generate → investigate → evaluate → branch/prune/confirm
- Automatic sub-hypothesis creation on branch
- Remediation planning from conclusion
✅ Implement conversation memory for chat mode
- src/agent/conversation-memory.ts with full memory management
- src/agent/__tests__/conversation-memory.test.ts with 34 passing tests
- Store conversation history and investigation context
- Context compression and summarization
- Reference previous findings
- Serialize/deserialize for persistence

Priority 2: Infrastructure - 100% Complete ✅

Total: 241 tests passing across 8 test files

✅ Add service dependency graph
- src/knowledge/store/graph-store.ts with full graph operations
- src/knowledge/store/__tests__/graph-store.test.ts with 31 passing tests
- Service nodes with type, team, tier, tags, and metadata
- Dependency edges with criticality and type
- Impact analysis (upstream and downstream)
- Path finding, cycle detection, and statistics
✅ Implement Slack webhook server for approval buttons
- src/webhooks/slack-webhook.ts with HTTP server for Slack interactivity
- src/webhooks/__tests__/slack-webhook.test.ts with 20 passing tests
- Slack signature verification for security
- Handle approve/reject button clicks
- Write response files for polling approval flow
- Update Slack message with approval status
- CLI command: runbook webhook to start server
- Health check endpoint at /health
- Pending approval listing and cleanup utilities
✅ Add Kubernetes integration
- src/providers/kubernetes/client.ts with kubectl wrapper
- src/providers/kubernetes/__tests__/client.test.ts with 34 passing tests
- Full K8s resource support: pods, deployments, services, nodes, events, etc.
- Pod status with container details, restarts, and node assignment
- Deployment status with replica counts and image info
- Node status with roles, conditions, capacity, and allocatable resources
- Log retrieval with tail, since, previous, and container options
- Deployment operations: scale, restart, rollback, rollout status
- Resource usage with top pods/nodes
- Cluster info and multi-context support
✅ Add AI provider configuration to init wizard
- Setup wizard now asks for LLM provider (Anthropic, OpenAI, Ollama)
- API key input with provider-specific instructions
- Saves to .runbook/config.yaml with model defaults

Usage:

# Quick setup
runbook init --template ecs-rds --regions us-east-1

# Interactive chat mode
runbook chat

# One-shot queries
runbook ask "what's running in prod?"
runbook ask "show me all S3 buckets and Lambda functions"

# Investigate incident
runbook investigate PD-12345

# Check status
runbook status

# Search knowledge
runbook knowledge search "redis timeout"

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Runbook: Agentic Cloud Operator & Incident Investigator

Core Methodology

Investigation Flow

Implementation Plan

Phase 1: Project Foundation

Phase 2: Core Agent Loop

Phase 3: Hypothesis Engine

Phase 4: Cloud Provider Tools (AWS First)

Phase 5: Safety & Approval System

Phase 6: Observability Tools

Phase 7: Incident Management Integration

Phase 8: Knowledge System

Phase 9: Skills System

Phase 10: CLI Interface

Phase 11: Learning & Suggestions

Phase 12: Multi-Cloud Expansion (Future)

Project Structure

Configuration Schema

Key Design Decisions

1. Hypothesis-Driven Investigation

2. Graceful Limits

3. Research-First for Mutations

4. Full Audit Trail

5. Knowledge as First-Class Citizen

6. Multi-Cloud Ready

Dependencies

Core

AWS

Incident Management

Knowledge

CLI

Success Metrics

Progress Summary

Priority 1: Agent Reasoning System (see docs/AGENT_DESIGN.md) - 100% Complete ✅

Priority 2: Infrastructure - 100% Complete ✅

FilesExpand file tree

PLAN.md

Latest commit

History

PLAN.md

File metadata and controls

Runbook: Agentic Cloud Operator & Incident Investigator

Core Methodology

Investigation Flow

Implementation Plan

Phase 1: Project Foundation

Phase 2: Core Agent Loop

Phase 3: Hypothesis Engine

Phase 4: Cloud Provider Tools (AWS First)

Phase 5: Safety & Approval System

Phase 6: Observability Tools

Phase 7: Incident Management Integration

Phase 8: Knowledge System

Phase 9: Skills System

Phase 10: CLI Interface

Phase 11: Learning & Suggestions

Phase 12: Multi-Cloud Expansion (Future)

Project Structure

Configuration Schema

Key Design Decisions

1. Hypothesis-Driven Investigation

2. Graceful Limits

3. Research-First for Mutations

4. Full Audit Trail

5. Knowledge as First-Class Citizen

6. Multi-Cloud Ready

Dependencies

Core

AWS

Incident Management

Knowledge

CLI

Success Metrics

Progress Summary

Priority 1: Agent Reasoning System (see docs/AGENT_DESIGN.md) - 100% Complete ✅

Priority 2: Infrastructure - 100% Complete ✅