Skip to content

Latest commit

 

History

History
702 lines (626 loc) · 26.9 KB

File metadata and controls

702 lines (626 loc) · 26.9 KB

Runbook: Agentic Cloud Operator & Incident Investigator

An AI-powered SRE assistant that investigates incidents, executes runbooks, and manages cloud infrastructure using a research-first, hypothesis-driven methodology.


Core Methodology

Source Contribution
Dexter Research-first architecture, scratchpad audit trail, skills, graceful limits
Bits AI (Datadog) Hypothesis branching, causal focus, evidence-based pruning
Organizational Knowledge Runbooks, post-mortems, architecture docs, service ownership

Investigation Flow

Incident Alert (PagerDuty/OpsGenie)
    ↓
Initial Context Gathering
    ├─ Alert metadata
    ├─ Recent deployments
    ├─ Service dependencies
    └─ Retrieved organizational knowledge
    ↓
Hypothesis Formation (3-5 initial hypotheses)
    ↓
Parallel Hypothesis Testing (targeted queries only)
    ↓
Branch (strong evidence) / Prune (no evidence)
    ↓
Recursive Investigation (max depth: 4)
    ↓
Root Cause Identification + Confidence Score
    ↓
Remediation (with approval for mutations)
    ↓
Scratchpad: Full Audit Trail

Implementation Plan

Phase 1: Project Foundation

  • Initialize project structure
  • Create PLAN.md
  • Set up TypeScript + Bun configuration
  • Set up ESLint + Prettier
  • Create base directory structure
  • Add core dependencies (Anthropic SDK, AWS SDK, etc.)

Phase 2: Core Agent Loop

  • Implement base Agent class (src/agent/agent.ts)
    • Async generator pattern for event streaming
    • Iteration loop with max iterations
    • Tool execution pipeline
  • Implement Scratchpad (src/agent/scratchpad.ts)
    • JSONL persistence
    • Tool call tracking
    • Graceful limits (warn, don't block)
    • Similar query detection
  • Implement prompt builder (src/agent/prompts.ts)
    • System prompt with tool descriptions
    • Iteration prompt with accumulated results
    • Final answer prompt
  • Implement event types (src/agent/types.ts)
    • ThinkingEvent, ToolStartEvent, ToolEndEvent, etc.
    • Investigation-specific events

Phase 3: Hypothesis Engine

  • Implement Hypothesis tree (src/agent/hypothesis.ts)
    • Hypothesis interface (id, statement, evidence, children)
    • InvestigationTree class
    • Branch and prune operations
    • Tree serialization for scratchpad
  • Implement confidence scoring (src/agent/confidence.ts)
    • Evidence strength classification (strong/weak/none)
    • Multi-factor confidence calculation
    • Temporal correlation detection
  • Implement causal query builder (src/agent/causal-query.ts)
    • Hypothesis-targeted query generation
    • Anti-pattern detection (prevent broad data gathering)
    • Query prioritization by hypothesis confidence
    • Query refinement suggestions

Phase 4: Cloud Provider Tools (AWS First)

  • Implement AWS client wrapper (src/providers/aws/client.ts)
    • Credential management (assume-role, profiles)
    • Region handling (multi-region support)
    • Multi-account support
  • Dynamic AWS Service System (src/providers/aws/services.ts, executor.ts)
    • Declarative service definitions for 40+ AWS services
    • Dynamic SDK client loading (lazy imports)
    • Automatic pagination handling
    • Unified resource formatting
    • Services by category: compute, database, storage, networking, security, analytics, integration, devtools, ml, management
  • Implement AWS query meta-router (src/tools/registry.ts - aws_query)
    • Natural language to AWS API routing
    • Query by service ID or category
    • Parallel multi-service queries
    • Result aggregation
  • Supported AWS Services (40+):
    • Compute: EC2, ECS, EKS, Lambda, Lightsail, App Runner, Amplify, Batch, ECR
    • Database: RDS, DynamoDB, ElastiCache, DocumentDB, Neptune, Redshift, MemoryDB
    • Storage: S3, EFS, FSx, Backup
    • Networking: VPC, ELB, CloudFront, Route 53, API Gateway, API Gateway V2
    • Security: IAM, Secrets Manager, KMS, ACM, WAF
    • Integration: SQS, SNS, EventBridge, Step Functions, Kinesis
    • Management: CloudWatch, CloudWatch Logs, SSM, CloudFormation
    • DevTools: CodePipeline, CodeBuild, CodeCommit
    • Analytics: Athena, Glue, OpenSearch
    • ML: SageMaker, Bedrock, Comprehend
  • Implement AWS mutation tool (src/tools/registry.ts - aws_mutate)
    • Approval flow integration
    • Rollback command display
    • Risk classification
    • Supported: ECS UpdateService, EC2 Reboot/Start/Stop, Lambda UpdateConfig

Phase 5: Safety & Approval System

  • Implement safety layer (src/agent/safety.ts)
    • Operation risk classification (read/low/high/critical)
    • Mutation limits per session
    • Cooldown between high-risk operations
  • Implement approval flow (src/agent/approval.ts)
    • CLI confirmation prompts with risk display
    • Risk-based approval (critical ops require typing 'yes')
    • Cooldown enforcement for critical mutations
    • Audit logging to .runbook/audit/approvals.jsonl
    • Slack approval integration
      • Send approval requests to Slack with buttons
      • Race between Slack and CLI approval
      • Configurable timeout
      • Auto-approval for specified risk levels

Phase 6: Observability Tools

  • Implement CloudWatch tools (src/tools/aws/cloudwatch.ts)
    • Log filtering and search
    • Alarm status
    • Log group listing
  • Implement Datadog tools (src/tools/observability/datadog.ts)
    • Metric queries
    • Log search
    • APM trace search
    • Monitor/alert status
    • Events timeline
    • Service catalog
  • Implement generic metrics interface
    • Prometheus support
      • Instant and range queries
      • Firing alerts
      • Target health monitoring
      • Common metric shortcuts
    • Custom metrics endpoints

Phase 7: Incident Management Integration

  • Implement PagerDuty tools (src/tools/incident/pagerduty.ts)
    • Get incident details
    • List incidents with filters
    • Get alerts for incident
    • Get service configuration
    • Add investigation notes
    • Acknowledge/resolve incidents
  • Implement OpsGenie tools (src/tools/incident/opsgenie.ts)
    • Get alert details
    • List alerts with filters
    • Get incident details
    • List incidents
    • Add notes to alerts
    • Acknowledge/close alerts
  • Implement Slack integration (src/tools/incident/slack.ts)
    • Post investigation updates with rich formatting
    • Post root cause identification
    • Read channel/thread messages
    • Send simple messages
    • Request mutation approval via Slack
    • Handle approval button interactions (requires webhook server)

Phase 8: Knowledge System

  • Implement knowledge types (src/knowledge/types.ts)
    • KnowledgeDocument, KnowledgeChunk interfaces
    • Source configurations
  • Implement filesystem source (src/knowledge/sources/filesystem.ts)
    • Markdown parsing with frontmatter
    • YAML support
    • File watching for hot reload
  • Implement source dispatcher (src/knowledge/sources/index.ts)
    • Unified loadFromSource entry point
    • Routes to filesystem, confluence, or google_drive loaders
  • Implement Confluence source (src/knowledge/sources/confluence.ts)
    • REST API v2 (with v1 fallback)
    • Label filtering for runbooks/postmortems
    • HTML to markdown conversion
    • Incremental sync via lastSyncTime
    • Metadata extraction from labels
  • Implement Google Drive source (src/knowledge/sources/google-drive.ts)
    • OAuth2 authentication flow (google-auth.ts)
    • Google Docs export to plain text
    • Google Sheets export to markdown tables
    • Subfolder traversal
    • Incremental sync via modifiedTime
    • Metadata from file properties
  • Implement chunker (in src/knowledge/sources/filesystem.ts)
    • Markdown-aware chunking by sections
    • Section title preservation
    • Chunk type inference (context, procedure, command, etc.)
  • Implement SQLite store (src/knowledge/store/sqlite.ts)
    • FTS5 full-text search
    • Document and chunk storage
    • Type and service filtering
  • Implement knowledge retriever (src/knowledge/retriever/index.ts)
    • Sync from filesystem sources
    • Search with type/service filters
    • Organized results by knowledge type
  • Implement embedder (src/knowledge/indexer/embedder.ts)
    • OpenAI embeddings integration (text-embedding-3-small)
    • Batch processing for efficiency
    • In-memory caching
    • Cost estimation
  • Implement vector store (src/knowledge/store/vector-store.ts)
    • SQLite storage for embeddings
    • Cosine similarity search
    • Type and service filtering
  • Implement hybrid retriever (src/knowledge/retriever/hybrid-search.ts)
    • Combines FTS and vector search
    • Reciprocal Rank Fusion (RRF) for merging
    • Configurable weights for each approach
  • Implement service graph (src/knowledge/store/graph-store.ts)
    • Service nodes and edges
    • Dependency traversal (upstream/downstream impact)
    • Ownership lookup (by team, owner)
    • Service filtering (by type, tier, tag, team)
    • Path finding and cycle detection
  • Implement reranker (src/knowledge/retriever/reranker.ts)
    • LLM-based relevance scoring
    • Hypothesis-aware ranking
  • Implement context builder (src/knowledge/retriever/context-builder.ts)
    • Assemble retrieved knowledge for prompts
    • Token budget management

Phase 9: Skills System

  • Implement skill types (src/skills/types.ts)
    • SkillDefinition, SkillStep, SkillParameter interfaces
    • Execution context and result types
  • Implement skill registry (src/skills/registry.ts)
    • Built-in skill registration
    • User skill loading from .runbook/skills/
    • Skill lookup by ID, tag, or service
  • Implement skill executor (src/skills/executor.ts)
    • Step-by-step execution
    • Parameter substitution with templates
    • Conditional step execution
    • Error handling (continue/abort/retry)
    • Approval flow integration
  • Create core skills (src/skills/builtin/)
    • investigate-incident - Hypothesis-driven investigation
    • deploy-service - Safe deployment with pre/post checks
    • scale-service - Capacity planning and scaling
    • troubleshoot-service - Diagnose and fix issues
    • rollback-deployment - Quick and safe rollback
    • cost-analysis - Spending analysis and optimization
    • security-audit - IAM and security review

Phase 10: CLI Interface

  • Implement CLI entry point (src/cli.tsx)
    • Ink-based React CLI
    • Command parsing
    • Configuration loading
  • Implement core commands
    • runbook investigate <incident-id> - Investigate incident
    • runbook ask <query> - Natural language cloud queries
    • runbook chat - Interactive conversation mode
    • runbook deploy <service> - Deploy workflow with dry-run option
    • runbook status - Current infrastructure status
  • Implement knowledge commands
    • runbook knowledge sync - Sync from sources
    • runbook knowledge search <query> - Search knowledge base with filters
    • runbook knowledge add <file> - Add local knowledge
    • runbook knowledge validate - Check for stale content
    • runbook knowledge stats - Show knowledge base statistics
    • runbook knowledge auth google - OAuth2 flow for Google Drive
  • Implement config commands
    • runbook init - Interactive setup wizard with step-by-step configuration
    • runbook config - Show current configuration
    • runbook config --set key=value - Set config values (supports nested keys)

Phase 11: Learning & Suggestions

  • Implement learning module (src/knowledge/learning/)
    • Post-investigation analysis
    • Runbook suggestion generation
    • Known issue detection (recurring patterns)
  • Implement knowledge update suggestions
    • New runbook drafts
    • Runbook update patches
    • Post-mortem drafts

Phase 12: Multi-Cloud Expansion (Future)

  • GCP provider (src/providers/gcp/)
  • Azure provider (src/providers/azure/)
  • Kubernetes provider (src/providers/kubernetes/)
  • Terraform integration (src/providers/terraform/)

Project Structure

runbook/
├── src/
│   ├── agent/
│   │   ├── agent.ts              # Main agent loop
│   │   ├── hypothesis.ts         # Hypothesis tree management
│   │   ├── confidence.ts         # Evidence scoring
│   │   ├── prompts.ts            # Prompt templates
│   │   ├── scratchpad.ts         # Audit trail
│   │   ├── safety.ts             # Mutation controls
│   │   └── types.ts              # Event types
│   ├── providers/
│   │   ├── aws/
│   │   │   ├── client.ts         # AWS SDK wrapper
│   │   │   └── tools/            # EC2, ECS, Lambda, etc.
│   │   ├── gcp/                  # Future
│   │   └── kubernetes/           # Future
│   ├── tools/
│   │   ├── registry.ts           # Tool registration
│   │   ├── skill.ts              # Skill invocation
│   │   ├── aws/
│   │   │   ├── aws-query.ts      # Read-only meta-router
│   │   │   └── aws-mutate.ts     # State changes
│   │   ├── observability/
│   │   │   ├── causal-query.ts   # Hypothesis-targeted queries
│   │   │   ├── cloudwatch.ts
│   │   │   └── datadog.ts
│   │   └── incident/
│   │       ├── pagerduty.ts
│   │       ├── opsgenie.ts
│   │       └── slack.ts
│   ├── knowledge/
│   │   ├── types.ts
│   │   ├── sources/
│   │   │   ├── filesystem.ts
│   │   │   ├── confluence.ts     # Future
│   │   │   └── github.ts         # Future
│   │   ├── indexer/
│   │   │   ├── chunker.ts
│   │   │   ├── embedder.ts
│   │   │   └── metadata.ts
│   │   ├── store/
│   │   │   ├── vector-store.ts
│   │   │   ├── graph-store.ts
│   │   │   └── sqlite.ts
│   │   ├── retriever/
│   │   │   ├── hybrid-search.ts
│   │   │   ├── reranker.ts
│   │   │   └── context-builder.ts
│   │   └── learning/
│   │       ├── suggest-updates.ts
│   │       └── auto-enrich.ts
│   ├── skills/
│   │   ├── registry.ts
│   │   ├── investigate-incident/
│   │   │   └── SKILL.md
│   │   ├── deploy-service/
│   │   │   └── SKILL.md
│   │   ├── scale-service/
│   │   │   └── SKILL.md
│   │   ├── troubleshoot-service/
│   │   │   └── SKILL.md
│   │   └── cost-analysis/
│   │       └── SKILL.md
│   ├── model/
│   │   └── llm.ts                # LLM client with caching
│   ├── hooks/
│   │   └── useAgentRunner.ts     # React hook for CLI
│   ├── utils/
│   │   ├── tokens.ts             # Token counting
│   │   └── config.ts             # Configuration loading
│   └── cli.tsx                   # CLI entry point
├── .runbook/                     # User configuration (gitignored)
│   ├── config.yaml
│   ├── runbooks/                 # Local runbooks
│   ├── knowledge.db              # SQLite + vectors
│   ├── service-graph.json
│   ├── scratchpad/               # Investigation logs
│   └── investigations/           # Investigation trees
├── examples/
│   └── runbooks/                 # Example runbooks
├── package.json
├── tsconfig.json
├── bunfig.toml
├── PLAN.md                       # This file
└── README.md

Configuration Schema

.runbook/config.yaml

# LLM Configuration
llm:
  provider: anthropic  # anthropic | openai
  model: claude-sonnet-4-20250514
  api_key: ${ANTHROPIC_API_KEY}

# Cloud Providers
providers:
  aws:
    enabled: true
    regions: [us-east-1, us-west-2]
    profile: default  # AWS profile or use env vars

# Incident Management
incident:
  pagerduty:
    enabled: true
    api_key: ${PAGERDUTY_API_KEY}
  opsgenie:
    enabled: false
  slack:
    enabled: true
    bot_token: ${SLACK_BOT_TOKEN}

# Knowledge Sources
knowledge:
  sources:
    - type: filesystem
      path: .runbook/runbooks/
      watch: true
    - type: filesystem
      path: ~/.runbook/knowledge/

  store:
    type: local
    path: .runbook/knowledge.db
    embedding_model: text-embedding-3-small

  retrieval:
    top_k: 10
    rerank: true

# Safety
safety:
  require_approval:
    - high_risk
    - critical
  max_mutations_per_session: 5
  cooldown_between_critical_ms: 60000

# Agent
agent:
  max_iterations: 10
  max_hypothesis_depth: 4
  context_threshold_tokens: 100000

Key Design Decisions

1. Hypothesis-Driven Investigation

Rather than gathering all available data, we form hypotheses and test them with targeted queries. This reduces noise and focuses on causal relationships.

2. Graceful Limits

Tool limits warn but never block. The agent can always proceed, but gets warnings to prevent retry loops.

3. Research-First for Mutations

All state-changing operations require prior research to understand current state and impact.

4. Full Audit Trail

Every tool call, hypothesis, and decision is logged to JSONL for compliance and debugging.

5. Knowledge as First-Class Citizen

Organizational runbooks and post-mortems are indexed and retrieved during investigations, not just appended as context.

6. Multi-Cloud Ready

Provider abstraction allows adding GCP, Azure, K8s without changing core agent logic.


Dependencies

Core

  • bun - Runtime
  • typescript - Type safety
  • @langchain/anthropic - LLM integration
  • @langchain/core - Agent primitives
  • zod - Schema validation

AWS

  • @aws-sdk/client-ec2
  • @aws-sdk/client-ecs
  • @aws-sdk/client-lambda
  • @aws-sdk/client-rds
  • @aws-sdk/client-elasticache
  • @aws-sdk/client-cloudwatch
  • @aws-sdk/client-cloudwatch-logs
  • @aws-sdk/client-iam

Incident Management

  • node-pagerduty or raw API
  • @slack/web-api

Knowledge

  • better-sqlite3 - Local storage
  • sqlite-vss - Vector search
  • openai - Embeddings
  • gray-matter - Frontmatter parsing
  • marked - Markdown parsing

CLI

  • ink - React for CLI
  • ink-spinner - Loading states
  • commander - Command parsing
  • chalk - Colors

Success Metrics

  1. Investigation Accuracy: Root cause correctly identified in >80% of incidents
  2. Time to Resolution: Reduce MTTR by providing faster diagnosis
  3. Runbook Coverage: Track which incidents had matching runbooks
  4. Knowledge Freshness: Alert on stale runbooks (>90 days without validation)
  5. Safety: Zero unauthorized mutations, full audit trail

Progress Summary

Completed:

  • Phase 1: Project Foundation (100%)
  • Phase 2: Core Agent Loop (100%)
  • Phase 3: Hypothesis Engine (100% - causal query builder with anti-pattern detection)
  • Phase 4: AWS Tools (100% - 40+ services with dynamic loading)
  • Phase 5: Safety Layer (100% - approval flow with Slack integration)
  • Phase 6: Observability (100% - CloudWatch, Datadog, Prometheus integration)
  • Phase 7: Incident Management (100% - PagerDuty, OpsGenie, Slack complete)
  • Phase 8: Knowledge System (95% - FTS, vector embeddings, hybrid search)
  • Phase 9: Skills (100% - 7 core skills with executor and registry)
  • Phase 10: CLI Interface (100% - all commands implemented)

New Features:

  • Multi-AWS account support with assume-role and profiles
  • Service configuration system for targeted infrastructure scanning
  • Quick setup templates (ecs-rds, serverless, enterprise)
  • Interactive setup wizard (runbook init) with step-by-step configuration
  • Dynamic AWS Service System (40+ services):
    • Declarative service definitions with automatic SDK loading
    • Query by service ID, category, or all services
    • Parallel execution with unified result formatting
    • Automatic pagination handling
    • Categories: compute, database, storage, networking, security, analytics, integration, devtools, ml, management
  • Mutation approval flow with risk classification (low/medium/high/critical)
  • AWS mutations: ECS scaling, EC2 start/stop/reboot, Lambda config updates
  • Audit trail for all approved/rejected mutations
  • Interactive chat interface (runbook chat) with conversation history
  • Datadog integration (metrics, logs, traces, monitors, events)
  • Skill system with 7 built-in workflows:
    • investigate-incident, deploy-service, scale-service
    • troubleshoot-service, rollback-deployment
    • cost-analysis, security-audit
  • Skill executor with templating, conditions, and error handling
  • User-defined skills via YAML in .runbook/skills/
  • Causal query builder with pattern-based investigation queries
    • Detects failure patterns (latency, errors, memory, CPU, etc.)
    • Generates targeted queries per hypothesis
    • Prevents broad data gathering with anti-pattern detection
  • Slack integration for incident communication:
    • Post investigation updates with rich Block Kit formatting
    • Post root cause identification with evidence
    • Read channel/thread context for investigation
    • Request approval for mutations via Slack buttons
  • OpsGenie integration:
    • Get/list alerts and incidents
    • Add investigation notes
    • Acknowledge and close alerts
  • Prometheus integration:
    • Instant and range PromQL queries
    • Firing alerts monitoring
    • Target health checks
    • Common metric shortcuts (CPU, memory, disk, network, K8s)
  • Knowledge system with semantic search:
    • OpenAI embeddings for vector similarity
    • Hybrid search (FTS + vector) with RRF fusion
    • Batch embedding with caching
    • Find similar past incidents and runbooks
  • Complete CLI with all commands:
    • runbook deploy with dry-run support
    • runbook knowledge add/validate/stats
    • runbook config --set for nested config values
  • Slack approval integration:
    • Send approval requests to Slack with buttons
    • Race between Slack and CLI approval
    • Auto-approval for configured risk levels

GitHub: https://github.com/manthan787/RunbookAI

Next Steps:

Priority 1: Agent Reasoning System (see docs/AGENT_DESIGN.md) - 100% Complete ✅

Total: 156 tests passing across 5 test files

  1. ✅ Implement investigation state machine (triage → hypothesize → investigate → evaluate → conclude → remediate)

    • src/agent/state-machine.ts with 8 phases, event emitter pattern
    • src/agent/__tests__/state-machine.test.ts with 45 passing tests
    • Hypothesis tree with max depth 4, max 10 hypotheses
    • Evidence evaluation with branch/prune/confirm/continue actions
  2. ✅ Add structured output parsing for LLM hypothesis/evidence responses

    • src/agent/llm-parser.ts with Zod schemas for all structured outputs
    • src/agent/__tests__/llm-parser.test.ts with 27 passing tests
    • Schemas: TriageResponse, HypothesisGeneration, EvidenceEvaluation, Conclusion, RemediationPlan, LogAnalysis
    • Prompt templates for each structured output type
  3. ✅ Add log analysis with pattern extraction

    • src/agent/log-analyzer.ts with pattern matching and log parsing
    • src/agent/__tests__/log-analyzer.test.ts with 34 passing tests
    • 11 error pattern categories (connection, memory, database, auth, K8s, etc.)
    • Time range extraction, service mention detection, log filtering
  4. ✅ Integrate causal query builder into investigation loop

    • src/agent/investigation-orchestrator.ts ties everything together
    • src/agent/__tests__/investigation-orchestrator.test.ts with 16 passing tests
    • Orchestrates: state machine, LLM parsing, log analysis, query execution
    • Event emitter for progress tracking
  5. ✅ Wire hypothesis engine lifecycle (create, update, prune, confirm)

    • Full lifecycle in orchestrator: generate → investigate → evaluate → branch/prune/confirm
    • Automatic sub-hypothesis creation on branch
    • Remediation planning from conclusion
  6. ✅ Implement conversation memory for chat mode

    • src/agent/conversation-memory.ts with full memory management
    • src/agent/__tests__/conversation-memory.test.ts with 34 passing tests
    • Store conversation history and investigation context
    • Context compression and summarization
    • Reference previous findings
    • Serialize/deserialize for persistence

Priority 2: Infrastructure - 100% Complete ✅

Total: 241 tests passing across 8 test files

  1. ✅ Add service dependency graph

    • src/knowledge/store/graph-store.ts with full graph operations
    • src/knowledge/store/__tests__/graph-store.test.ts with 31 passing tests
    • Service nodes with type, team, tier, tags, and metadata
    • Dependency edges with criticality and type
    • Impact analysis (upstream and downstream)
    • Path finding, cycle detection, and statistics
  2. ✅ Implement Slack webhook server for approval buttons

    • src/webhooks/slack-webhook.ts with HTTP server for Slack interactivity
    • src/webhooks/__tests__/slack-webhook.test.ts with 20 passing tests
    • Slack signature verification for security
    • Handle approve/reject button clicks
    • Write response files for polling approval flow
    • Update Slack message with approval status
    • CLI command: runbook webhook to start server
    • Health check endpoint at /health
    • Pending approval listing and cleanup utilities
  3. ✅ Add Kubernetes integration

    • src/providers/kubernetes/client.ts with kubectl wrapper
    • src/providers/kubernetes/__tests__/client.test.ts with 34 passing tests
    • Full K8s resource support: pods, deployments, services, nodes, events, etc.
    • Pod status with container details, restarts, and node assignment
    • Deployment status with replica counts and image info
    • Node status with roles, conditions, capacity, and allocatable resources
    • Log retrieval with tail, since, previous, and container options
    • Deployment operations: scale, restart, rollback, rollout status
    • Resource usage with top pods/nodes
    • Cluster info and multi-context support
  4. ✅ Add AI provider configuration to init wizard

    • Setup wizard now asks for LLM provider (Anthropic, OpenAI, Ollama)
    • API key input with provider-specific instructions
    • Saves to .runbook/config.yaml with model defaults

Usage:

# Quick setup
runbook init --template ecs-rds --regions us-east-1

# Interactive chat mode
runbook chat

# One-shot queries
runbook ask "what's running in prod?"
runbook ask "show me all S3 buckets and Lambda functions"

# Investigate incident
runbook investigate PD-12345

# Check status
runbook status

# Search knowledge
runbook knowledge search "redis timeout"