An AI-powered SRE assistant that investigates incidents, executes runbooks, and manages cloud infrastructure using a research-first, hypothesis-driven methodology.
| Source | Contribution |
|---|---|
| Dexter | Research-first architecture, scratchpad audit trail, skills, graceful limits |
| Bits AI (Datadog) | Hypothesis branching, causal focus, evidence-based pruning |
| Organizational Knowledge | Runbooks, post-mortems, architecture docs, service ownership |
Incident Alert (PagerDuty/OpsGenie)
↓
Initial Context Gathering
├─ Alert metadata
├─ Recent deployments
├─ Service dependencies
└─ Retrieved organizational knowledge
↓
Hypothesis Formation (3-5 initial hypotheses)
↓
Parallel Hypothesis Testing (targeted queries only)
↓
Branch (strong evidence) / Prune (no evidence)
↓
Recursive Investigation (max depth: 4)
↓
Root Cause Identification + Confidence Score
↓
Remediation (with approval for mutations)
↓
Scratchpad: Full Audit Trail
- Initialize project structure
- Create PLAN.md
- Set up TypeScript + Bun configuration
- Set up ESLint + Prettier
- Create base directory structure
- Add core dependencies (Anthropic SDK, AWS SDK, etc.)
- Implement base Agent class (
src/agent/agent.ts)- Async generator pattern for event streaming
- Iteration loop with max iterations
- Tool execution pipeline
- Implement Scratchpad (
src/agent/scratchpad.ts)- JSONL persistence
- Tool call tracking
- Graceful limits (warn, don't block)
- Similar query detection
- Implement prompt builder (
src/agent/prompts.ts)- System prompt with tool descriptions
- Iteration prompt with accumulated results
- Final answer prompt
- Implement event types (
src/agent/types.ts)- ThinkingEvent, ToolStartEvent, ToolEndEvent, etc.
- Investigation-specific events
- Implement Hypothesis tree (
src/agent/hypothesis.ts)- Hypothesis interface (id, statement, evidence, children)
- InvestigationTree class
- Branch and prune operations
- Tree serialization for scratchpad
- Implement confidence scoring (
src/agent/confidence.ts)- Evidence strength classification (strong/weak/none)
- Multi-factor confidence calculation
- Temporal correlation detection
- Implement causal query builder (
src/agent/causal-query.ts)- Hypothesis-targeted query generation
- Anti-pattern detection (prevent broad data gathering)
- Query prioritization by hypothesis confidence
- Query refinement suggestions
- Implement AWS client wrapper (
src/providers/aws/client.ts)- Credential management (assume-role, profiles)
- Region handling (multi-region support)
- Multi-account support
- Dynamic AWS Service System (
src/providers/aws/services.ts,executor.ts)- Declarative service definitions for 40+ AWS services
- Dynamic SDK client loading (lazy imports)
- Automatic pagination handling
- Unified resource formatting
- Services by category: compute, database, storage, networking, security, analytics, integration, devtools, ml, management
- Implement AWS query meta-router (
src/tools/registry.ts - aws_query)- Natural language to AWS API routing
- Query by service ID or category
- Parallel multi-service queries
- Result aggregation
- Supported AWS Services (40+):
- Compute: EC2, ECS, EKS, Lambda, Lightsail, App Runner, Amplify, Batch, ECR
- Database: RDS, DynamoDB, ElastiCache, DocumentDB, Neptune, Redshift, MemoryDB
- Storage: S3, EFS, FSx, Backup
- Networking: VPC, ELB, CloudFront, Route 53, API Gateway, API Gateway V2
- Security: IAM, Secrets Manager, KMS, ACM, WAF
- Integration: SQS, SNS, EventBridge, Step Functions, Kinesis
- Management: CloudWatch, CloudWatch Logs, SSM, CloudFormation
- DevTools: CodePipeline, CodeBuild, CodeCommit
- Analytics: Athena, Glue, OpenSearch
- ML: SageMaker, Bedrock, Comprehend
- Implement AWS mutation tool (
src/tools/registry.ts - aws_mutate)- Approval flow integration
- Rollback command display
- Risk classification
- Supported: ECS UpdateService, EC2 Reboot/Start/Stop, Lambda UpdateConfig
- Implement safety layer (
src/agent/safety.ts)- Operation risk classification (read/low/high/critical)
- Mutation limits per session
- Cooldown between high-risk operations
- Implement approval flow (
src/agent/approval.ts)- CLI confirmation prompts with risk display
- Risk-based approval (critical ops require typing 'yes')
- Cooldown enforcement for critical mutations
- Audit logging to
.runbook/audit/approvals.jsonl - Slack approval integration
- Send approval requests to Slack with buttons
- Race between Slack and CLI approval
- Configurable timeout
- Auto-approval for specified risk levels
- Implement CloudWatch tools (
src/tools/aws/cloudwatch.ts)- Log filtering and search
- Alarm status
- Log group listing
- Implement Datadog tools (
src/tools/observability/datadog.ts)- Metric queries
- Log search
- APM trace search
- Monitor/alert status
- Events timeline
- Service catalog
- Implement generic metrics interface
- Prometheus support
- Instant and range queries
- Firing alerts
- Target health monitoring
- Common metric shortcuts
- Custom metrics endpoints
- Prometheus support
- Implement PagerDuty tools (
src/tools/incident/pagerduty.ts)- Get incident details
- List incidents with filters
- Get alerts for incident
- Get service configuration
- Add investigation notes
- Acknowledge/resolve incidents
- Implement OpsGenie tools (
src/tools/incident/opsgenie.ts)- Get alert details
- List alerts with filters
- Get incident details
- List incidents
- Add notes to alerts
- Acknowledge/close alerts
- Implement Slack integration (
src/tools/incident/slack.ts)- Post investigation updates with rich formatting
- Post root cause identification
- Read channel/thread messages
- Send simple messages
- Request mutation approval via Slack
- Handle approval button interactions (requires webhook server)
- Implement knowledge types (
src/knowledge/types.ts)- KnowledgeDocument, KnowledgeChunk interfaces
- Source configurations
- Implement filesystem source (
src/knowledge/sources/filesystem.ts)- Markdown parsing with frontmatter
- YAML support
- File watching for hot reload
- Implement source dispatcher (
src/knowledge/sources/index.ts)- Unified loadFromSource entry point
- Routes to filesystem, confluence, or google_drive loaders
- Implement Confluence source (
src/knowledge/sources/confluence.ts)- REST API v2 (with v1 fallback)
- Label filtering for runbooks/postmortems
- HTML to markdown conversion
- Incremental sync via lastSyncTime
- Metadata extraction from labels
- Implement Google Drive source (
src/knowledge/sources/google-drive.ts)- OAuth2 authentication flow (
google-auth.ts) - Google Docs export to plain text
- Google Sheets export to markdown tables
- Subfolder traversal
- Incremental sync via modifiedTime
- Metadata from file properties
- OAuth2 authentication flow (
- Implement chunker (in
src/knowledge/sources/filesystem.ts)- Markdown-aware chunking by sections
- Section title preservation
- Chunk type inference (context, procedure, command, etc.)
- Implement SQLite store (
src/knowledge/store/sqlite.ts)- FTS5 full-text search
- Document and chunk storage
- Type and service filtering
- Implement knowledge retriever (
src/knowledge/retriever/index.ts)- Sync from filesystem sources
- Search with type/service filters
- Organized results by knowledge type
- Implement embedder (
src/knowledge/indexer/embedder.ts)- OpenAI embeddings integration (text-embedding-3-small)
- Batch processing for efficiency
- In-memory caching
- Cost estimation
- Implement vector store (
src/knowledge/store/vector-store.ts)- SQLite storage for embeddings
- Cosine similarity search
- Type and service filtering
- Implement hybrid retriever (
src/knowledge/retriever/hybrid-search.ts)- Combines FTS and vector search
- Reciprocal Rank Fusion (RRF) for merging
- Configurable weights for each approach
- Implement service graph (
src/knowledge/store/graph-store.ts)- Service nodes and edges
- Dependency traversal (upstream/downstream impact)
- Ownership lookup (by team, owner)
- Service filtering (by type, tier, tag, team)
- Path finding and cycle detection
- Implement reranker (
src/knowledge/retriever/reranker.ts)- LLM-based relevance scoring
- Hypothesis-aware ranking
- Implement context builder (
src/knowledge/retriever/context-builder.ts)- Assemble retrieved knowledge for prompts
- Token budget management
- Implement skill types (
src/skills/types.ts)- SkillDefinition, SkillStep, SkillParameter interfaces
- Execution context and result types
- Implement skill registry (
src/skills/registry.ts)- Built-in skill registration
- User skill loading from .runbook/skills/
- Skill lookup by ID, tag, or service
- Implement skill executor (
src/skills/executor.ts)- Step-by-step execution
- Parameter substitution with templates
- Conditional step execution
- Error handling (continue/abort/retry)
- Approval flow integration
- Create core skills (
src/skills/builtin/)-
investigate-incident- Hypothesis-driven investigation -
deploy-service- Safe deployment with pre/post checks -
scale-service- Capacity planning and scaling -
troubleshoot-service- Diagnose and fix issues -
rollback-deployment- Quick and safe rollback -
cost-analysis- Spending analysis and optimization -
security-audit- IAM and security review
-
- Implement CLI entry point (
src/cli.tsx)- Ink-based React CLI
- Command parsing
- Configuration loading
- Implement core commands
-
runbook investigate <incident-id>- Investigate incident -
runbook ask <query>- Natural language cloud queries -
runbook chat- Interactive conversation mode -
runbook deploy <service>- Deploy workflow with dry-run option -
runbook status- Current infrastructure status
-
- Implement knowledge commands
-
runbook knowledge sync- Sync from sources -
runbook knowledge search <query>- Search knowledge base with filters -
runbook knowledge add <file>- Add local knowledge -
runbook knowledge validate- Check for stale content -
runbook knowledge stats- Show knowledge base statistics -
runbook knowledge auth google- OAuth2 flow for Google Drive
-
- Implement config commands
-
runbook init- Interactive setup wizard with step-by-step configuration -
runbook config- Show current configuration -
runbook config --set key=value- Set config values (supports nested keys)
-
- Implement learning module (
src/knowledge/learning/)- Post-investigation analysis
- Runbook suggestion generation
- Known issue detection (recurring patterns)
- Implement knowledge update suggestions
- New runbook drafts
- Runbook update patches
- Post-mortem drafts
- GCP provider (
src/providers/gcp/) - Azure provider (
src/providers/azure/) - Kubernetes provider (
src/providers/kubernetes/) - Terraform integration (
src/providers/terraform/)
runbook/
├── src/
│ ├── agent/
│ │ ├── agent.ts # Main agent loop
│ │ ├── hypothesis.ts # Hypothesis tree management
│ │ ├── confidence.ts # Evidence scoring
│ │ ├── prompts.ts # Prompt templates
│ │ ├── scratchpad.ts # Audit trail
│ │ ├── safety.ts # Mutation controls
│ │ └── types.ts # Event types
│ ├── providers/
│ │ ├── aws/
│ │ │ ├── client.ts # AWS SDK wrapper
│ │ │ └── tools/ # EC2, ECS, Lambda, etc.
│ │ ├── gcp/ # Future
│ │ └── kubernetes/ # Future
│ ├── tools/
│ │ ├── registry.ts # Tool registration
│ │ ├── skill.ts # Skill invocation
│ │ ├── aws/
│ │ │ ├── aws-query.ts # Read-only meta-router
│ │ │ └── aws-mutate.ts # State changes
│ │ ├── observability/
│ │ │ ├── causal-query.ts # Hypothesis-targeted queries
│ │ │ ├── cloudwatch.ts
│ │ │ └── datadog.ts
│ │ └── incident/
│ │ ├── pagerduty.ts
│ │ ├── opsgenie.ts
│ │ └── slack.ts
│ ├── knowledge/
│ │ ├── types.ts
│ │ ├── sources/
│ │ │ ├── filesystem.ts
│ │ │ ├── confluence.ts # Future
│ │ │ └── github.ts # Future
│ │ ├── indexer/
│ │ │ ├── chunker.ts
│ │ │ ├── embedder.ts
│ │ │ └── metadata.ts
│ │ ├── store/
│ │ │ ├── vector-store.ts
│ │ │ ├── graph-store.ts
│ │ │ └── sqlite.ts
│ │ ├── retriever/
│ │ │ ├── hybrid-search.ts
│ │ │ ├── reranker.ts
│ │ │ └── context-builder.ts
│ │ └── learning/
│ │ ├── suggest-updates.ts
│ │ └── auto-enrich.ts
│ ├── skills/
│ │ ├── registry.ts
│ │ ├── investigate-incident/
│ │ │ └── SKILL.md
│ │ ├── deploy-service/
│ │ │ └── SKILL.md
│ │ ├── scale-service/
│ │ │ └── SKILL.md
│ │ ├── troubleshoot-service/
│ │ │ └── SKILL.md
│ │ └── cost-analysis/
│ │ └── SKILL.md
│ ├── model/
│ │ └── llm.ts # LLM client with caching
│ ├── hooks/
│ │ └── useAgentRunner.ts # React hook for CLI
│ ├── utils/
│ │ ├── tokens.ts # Token counting
│ │ └── config.ts # Configuration loading
│ └── cli.tsx # CLI entry point
├── .runbook/ # User configuration (gitignored)
│ ├── config.yaml
│ ├── runbooks/ # Local runbooks
│ ├── knowledge.db # SQLite + vectors
│ ├── service-graph.json
│ ├── scratchpad/ # Investigation logs
│ └── investigations/ # Investigation trees
├── examples/
│ └── runbooks/ # Example runbooks
├── package.json
├── tsconfig.json
├── bunfig.toml
├── PLAN.md # This file
└── README.md
.runbook/config.yaml
# LLM Configuration
llm:
provider: anthropic # anthropic | openai
model: claude-sonnet-4-20250514
api_key: ${ANTHROPIC_API_KEY}
# Cloud Providers
providers:
aws:
enabled: true
regions: [us-east-1, us-west-2]
profile: default # AWS profile or use env vars
# Incident Management
incident:
pagerduty:
enabled: true
api_key: ${PAGERDUTY_API_KEY}
opsgenie:
enabled: false
slack:
enabled: true
bot_token: ${SLACK_BOT_TOKEN}
# Knowledge Sources
knowledge:
sources:
- type: filesystem
path: .runbook/runbooks/
watch: true
- type: filesystem
path: ~/.runbook/knowledge/
store:
type: local
path: .runbook/knowledge.db
embedding_model: text-embedding-3-small
retrieval:
top_k: 10
rerank: true
# Safety
safety:
require_approval:
- high_risk
- critical
max_mutations_per_session: 5
cooldown_between_critical_ms: 60000
# Agent
agent:
max_iterations: 10
max_hypothesis_depth: 4
context_threshold_tokens: 100000Rather than gathering all available data, we form hypotheses and test them with targeted queries. This reduces noise and focuses on causal relationships.
Tool limits warn but never block. The agent can always proceed, but gets warnings to prevent retry loops.
All state-changing operations require prior research to understand current state and impact.
Every tool call, hypothesis, and decision is logged to JSONL for compliance and debugging.
Organizational runbooks and post-mortems are indexed and retrieved during investigations, not just appended as context.
Provider abstraction allows adding GCP, Azure, K8s without changing core agent logic.
bun- Runtimetypescript- Type safety@langchain/anthropic- LLM integration@langchain/core- Agent primitiveszod- Schema validation
@aws-sdk/client-ec2@aws-sdk/client-ecs@aws-sdk/client-lambda@aws-sdk/client-rds@aws-sdk/client-elasticache@aws-sdk/client-cloudwatch@aws-sdk/client-cloudwatch-logs@aws-sdk/client-iam
node-pagerdutyor raw API@slack/web-api
better-sqlite3- Local storagesqlite-vss- Vector searchopenai- Embeddingsgray-matter- Frontmatter parsingmarked- Markdown parsing
ink- React for CLIink-spinner- Loading statescommander- Command parsingchalk- Colors
- Investigation Accuracy: Root cause correctly identified in >80% of incidents
- Time to Resolution: Reduce MTTR by providing faster diagnosis
- Runbook Coverage: Track which incidents had matching runbooks
- Knowledge Freshness: Alert on stale runbooks (>90 days without validation)
- Safety: Zero unauthorized mutations, full audit trail
Completed:
- Phase 1: Project Foundation (100%)
- Phase 2: Core Agent Loop (100%)
- Phase 3: Hypothesis Engine (100% - causal query builder with anti-pattern detection)
- Phase 4: AWS Tools (100% - 40+ services with dynamic loading)
- Phase 5: Safety Layer (100% - approval flow with Slack integration)
- Phase 6: Observability (100% - CloudWatch, Datadog, Prometheus integration)
- Phase 7: Incident Management (100% - PagerDuty, OpsGenie, Slack complete)
- Phase 8: Knowledge System (95% - FTS, vector embeddings, hybrid search)
- Phase 9: Skills (100% - 7 core skills with executor and registry)
- Phase 10: CLI Interface (100% - all commands implemented)
New Features:
- Multi-AWS account support with assume-role and profiles
- Service configuration system for targeted infrastructure scanning
- Quick setup templates (ecs-rds, serverless, enterprise)
- Interactive setup wizard (
runbook init) with step-by-step configuration - Dynamic AWS Service System (40+ services):
- Declarative service definitions with automatic SDK loading
- Query by service ID, category, or all services
- Parallel execution with unified result formatting
- Automatic pagination handling
- Categories: compute, database, storage, networking, security, analytics, integration, devtools, ml, management
- Mutation approval flow with risk classification (low/medium/high/critical)
- AWS mutations: ECS scaling, EC2 start/stop/reboot, Lambda config updates
- Audit trail for all approved/rejected mutations
- Interactive chat interface (
runbook chat) with conversation history - Datadog integration (metrics, logs, traces, monitors, events)
- Skill system with 7 built-in workflows:
- investigate-incident, deploy-service, scale-service
- troubleshoot-service, rollback-deployment
- cost-analysis, security-audit
- Skill executor with templating, conditions, and error handling
- User-defined skills via YAML in .runbook/skills/
- Causal query builder with pattern-based investigation queries
- Detects failure patterns (latency, errors, memory, CPU, etc.)
- Generates targeted queries per hypothesis
- Prevents broad data gathering with anti-pattern detection
- Slack integration for incident communication:
- Post investigation updates with rich Block Kit formatting
- Post root cause identification with evidence
- Read channel/thread context for investigation
- Request approval for mutations via Slack buttons
- OpsGenie integration:
- Get/list alerts and incidents
- Add investigation notes
- Acknowledge and close alerts
- Prometheus integration:
- Instant and range PromQL queries
- Firing alerts monitoring
- Target health checks
- Common metric shortcuts (CPU, memory, disk, network, K8s)
- Knowledge system with semantic search:
- OpenAI embeddings for vector similarity
- Hybrid search (FTS + vector) with RRF fusion
- Batch embedding with caching
- Find similar past incidents and runbooks
- Complete CLI with all commands:
runbook deploywith dry-run supportrunbook knowledge add/validate/statsrunbook config --setfor nested config values
- Slack approval integration:
- Send approval requests to Slack with buttons
- Race between Slack and CLI approval
- Auto-approval for configured risk levels
GitHub: https://github.com/manthan787/RunbookAI
Next Steps:
Total: 156 tests passing across 5 test files
-
✅ Implement investigation state machine (triage → hypothesize → investigate → evaluate → conclude → remediate)
src/agent/state-machine.tswith 8 phases, event emitter patternsrc/agent/__tests__/state-machine.test.tswith 45 passing tests- Hypothesis tree with max depth 4, max 10 hypotheses
- Evidence evaluation with branch/prune/confirm/continue actions
-
✅ Add structured output parsing for LLM hypothesis/evidence responses
src/agent/llm-parser.tswith Zod schemas for all structured outputssrc/agent/__tests__/llm-parser.test.tswith 27 passing tests- Schemas: TriageResponse, HypothesisGeneration, EvidenceEvaluation, Conclusion, RemediationPlan, LogAnalysis
- Prompt templates for each structured output type
-
✅ Add log analysis with pattern extraction
src/agent/log-analyzer.tswith pattern matching and log parsingsrc/agent/__tests__/log-analyzer.test.tswith 34 passing tests- 11 error pattern categories (connection, memory, database, auth, K8s, etc.)
- Time range extraction, service mention detection, log filtering
-
✅ Integrate causal query builder into investigation loop
src/agent/investigation-orchestrator.tsties everything togethersrc/agent/__tests__/investigation-orchestrator.test.tswith 16 passing tests- Orchestrates: state machine, LLM parsing, log analysis, query execution
- Event emitter for progress tracking
-
✅ Wire hypothesis engine lifecycle (create, update, prune, confirm)
- Full lifecycle in orchestrator: generate → investigate → evaluate → branch/prune/confirm
- Automatic sub-hypothesis creation on branch
- Remediation planning from conclusion
-
✅ Implement conversation memory for chat mode
src/agent/conversation-memory.tswith full memory managementsrc/agent/__tests__/conversation-memory.test.tswith 34 passing tests- Store conversation history and investigation context
- Context compression and summarization
- Reference previous findings
- Serialize/deserialize for persistence
Total: 241 tests passing across 8 test files
-
✅ Add service dependency graph
src/knowledge/store/graph-store.tswith full graph operationssrc/knowledge/store/__tests__/graph-store.test.tswith 31 passing tests- Service nodes with type, team, tier, tags, and metadata
- Dependency edges with criticality and type
- Impact analysis (upstream and downstream)
- Path finding, cycle detection, and statistics
-
✅ Implement Slack webhook server for approval buttons
src/webhooks/slack-webhook.tswith HTTP server for Slack interactivitysrc/webhooks/__tests__/slack-webhook.test.tswith 20 passing tests- Slack signature verification for security
- Handle approve/reject button clicks
- Write response files for polling approval flow
- Update Slack message with approval status
- CLI command:
runbook webhookto start server - Health check endpoint at /health
- Pending approval listing and cleanup utilities
-
✅ Add Kubernetes integration
src/providers/kubernetes/client.tswith kubectl wrappersrc/providers/kubernetes/__tests__/client.test.tswith 34 passing tests- Full K8s resource support: pods, deployments, services, nodes, events, etc.
- Pod status with container details, restarts, and node assignment
- Deployment status with replica counts and image info
- Node status with roles, conditions, capacity, and allocatable resources
- Log retrieval with tail, since, previous, and container options
- Deployment operations: scale, restart, rollback, rollout status
- Resource usage with top pods/nodes
- Cluster info and multi-context support
-
✅ Add AI provider configuration to init wizard
- Setup wizard now asks for LLM provider (Anthropic, OpenAI, Ollama)
- API key input with provider-specific instructions
- Saves to .runbook/config.yaml with model defaults
Usage:
# Quick setup
runbook init --template ecs-rds --regions us-east-1
# Interactive chat mode
runbook chat
# One-shot queries
runbook ask "what's running in prod?"
runbook ask "show me all S3 buckets and Lambda functions"
# Investigate incident
runbook investigate PD-12345
# Check status
runbook status
# Search knowledge
runbook knowledge search "redis timeout"