Implement quadrant-based lambda logic for ChatOps commands#17
Merged
Conversation
Copilot
AI
changed the title
[WIP] Refactor ChatOps architecture for command isolation
Implement quadrant-based IAM permission boundaries for ChatOps commands
Dec 31, 2025
- Created command-registry.ts with CommandMetadata interface and COMMAND_REGISTRY - Categorized all commands into four quadrants (short-read, short-write, long-read, long-write) - Added helper functions for querying commands by category and permissions - Created comprehensive unit tests for command registry (23 tests, all passing) - Updated infrastructure to define four IAM policy statements (one per quadrant) - Added region restrictions and tag-based ABAC conditions to policies - Applied appropriate policies to Lambda handlers based on their roles - All tests pass, code builds successfully Co-authored-by: llama90 <6668548+llama90@users.noreply.github.com>
- Updated write permission detection to use specific action names - Fixed logic for distinguishing read vs write operations (e.g., athena:StartQueryExecution is read-only) - All 32 tests now passing Co-authored-by: llama90 <6668548+llama90@users.noreply.github.com>
- Extracted isWriteAction helper function to avoid code duplication - Fixed incorrect 'StartDeploy' check to 'CreateDeployment' to match actual permissions - All 32 tests passing Co-authored-by: llama90 <6668548+llama90@users.noreply.github.com>
Replace command-specific workers with unified quadrant workers for better scalability and maintainability. **Workers (routing layer):** - Remove: echo, build, deploy, status workers (command-specific) - Add: SR (short-read) and LW (long-write) unified workers - SR worker handles: /echo and future fast read commands - LW worker handles: /build, /deploy and future write commands **Handlers (business logic layer):** - Extract command logic into reusable handlers - handlers/echo.ts - Echo command logic - handlers/build.ts - Build command logic - Workers route to handlers based on command registry 1. **Extensibility**: New commands just need handler registration 2. **DRY**: Shared worker infrastructure for similar command types 3. **Performance**: Optimized timeouts and concurrency per quadrant 4. **Maintainability**: Clear separation of routing vs business logic - Build system: package.sh, Makefile, component-config.sh - CI/CD: slack-build.yml workflow - Local dev: LocalStack setup, .env.local.example - Documentation: CONFIGURATION.md, LOCAL-TESTING.md Aligns with cloud-sandbox PR #14 which provisions: - laco-plt-chatbot-command-sr-sqs queue - laco-plt-chatbot-command-lw-sqs queue - laco-plt-chatbot-command-sr-worker Lambda - laco-plt-chatbot-command-lw-worker Lambda
a04461b to
86f7017
Compare
The deploy-%-local pattern rule was being shadowed by the deploy-% rule, causing 'make deploy-sr-local' to incorrectly try building 'sr-local' instead of 'sr'. Make's pattern matching doesn't guarantee evaluation order, so explicit targets are more reliable than trying to order pattern rules. Changes: - Remove deploy-%-local pattern rule - Add explicit deploy-sr-local, deploy-lw-local, deploy-router-local targets - All use same logic: build component, then deploy with --local flag - deploy-all-local still works correctly Fixes error: 'Unknown component: sr-local'
Update all performance test scripts and documentation to reference the new SR (short-read) worker and queue names instead of the deprecated echo worker. Changes: - Lambda: chatbot-echo-worker → chatbot-command-sr-worker - Queue: chatbot-echo → chatbot-command-sr-queue - Component name: echo-worker → sr-worker - Updated analyze-performance.sh CloudWatch queries - Updated analyze-e2e-json.sh log group references - Updated README.md examples - Fixed X-Ray segment handling in echo handler (test compatibility) - Updated unit tests to test SR worker instead of echo worker - Added performance-tests/results/*.png to .gitignore This ensures E2E metrics are collected from the correct Lambda function and SQS queue after the quadrant-based refactor.
Restore structured performance metrics logging that was present in the original echo worker, enabling E2E latency tracking and component breakdown analysis in CloudWatch. Implementation matches the original echo/index.ts pattern from PR #18: - SR worker collects timing metrics and logs 'Performance metrics' - Handler returns syncResponseMs and asyncResponseMs - Worker calculates E2E, queue wait, and total duration - Metrics logged for both success and failure cases Performance metrics fields: - totalE2eMs: API Gateway → final response (end-to-end) - workerDurationMs: Lambda execution time - queueWaitMs: Time message spent in SQS (calculated) - syncResponseMs: Sync Slack response time (from handler) - asyncResponseMs: Async Slack response time (from handler) - component: 'sr-worker' for CloudWatch filtering - correlationId, command, success, errorType, errorMessage Changes: - Removed artificial 2-second sleep delay from echo handler - Echo handler now returns HandlerResult with timing metrics - SR worker logs structured metrics via logWorkerMetrics() This restores server-side metrics collection after the quadrant-based refactor, enabling performance test analysis scripts to work correctly.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
ChatOps commands currently share a single IAM role, giving read-only commands unnecessary write permissions and preventing blast radius isolation. This PR implements Phase 1: distinct IAM policies per command quadrant (execution time × side effects).
Changes
Command Registry (
src/shared/command-registry.ts)short-read,short-write,long-read,long-writegetCommandsByCategory(),getCategoryPermissions(),getCommandsRequiringApproval()Infrastructure (
infrastructure/lib/slack-bot-stack.ts)Replaced single
parameterStorePolicywith four quadrant-specific policies:Short-Read: CloudWatch/Lambda/ECS describe operations (read-only)
Short-Write: ECS/Lambda configuration updates with tag-based ABAC
Long-Read: Athena, S3, Glue, Cost Explorer (analytics)
Long-Write: CodeDeploy, CodeBuild, RDS/DynamoDB migrations with ABAC
Security controls added to all policies:
aws:RequestedRegionconditionaws:ResourceTag/ManagedBy=ChatOps) for write operationsApplied to Lambda handlers:
commandHandler: Parameter Store only (routes commands)processorHandler: Parameter Store + Long-Read (analytics)executorHandler: Parameter Store + all quadrants (executes all command types)Testing (
tests/unit/command-registry.test.ts)23 tests validating:
Security Impact
ManagedBy=ChatOpstagged resources onlyFuture Work
Phase 2: Separate SQS queues per quadrant
Phase 3: Dedicated Lambda functions per quadrant
Phase 4: Per-quadrant CloudWatch dashboards and SLOs
Original prompt
Problem Statement
Our ChatOps platform currently uses a one-size-fits-all architecture where all commands share the same infrastructure (queues, IAM roles, timeouts). This creates security and performance issues:
Command Execution Matrix
Commands are classified along two dimensions:
1. Execution Time
2. Side Effects
This creates four quadrants:
/status,/health,/metrics/scale,/restart/analyze,/report/deploy,/migratePhase 1 Requirements: Permission Boundaries (This PR)
Implement distinct IAM roles for each quadrant with least-privilege permissions:
Architecture Changes
Create four IAM roles (one per quadrant):
ShortReadRole: Read-only CloudWatch, Lambda metadataShortWriteRole: Scoped write permissions (ECS, Lambda config)LongReadRole: Read-only cost data, metrics, Athena queriesLongWriteRole: Deployment permissions (CodeDeploy, ECS)Update infrastructure (
applications/chatops/slack-bot/infrastructure/lib/slack-bot-stack.ts):parameterStorePolicywith quadrant-specific policiesCreate command registry (
src/shared/command-registry.ts):Update router (
src/router/index.ts):commandCategoryin EventBridge detailUpdate workers to reference appropriate roles:
ShortReadRoleShortReadRoleLongWriteRoleLongWriteRoleSecurity Requirements
✅ DO:
SECURITY.md❌ DON'T:
Testing Requirements
Unit tests (
tests/unit/command-classifier.test.ts):Integration tests:
Documentation Requirements
Update the following files:
applications/chatops/slack-bot/README.md: Document quadrant modelARCHITECTURE.md: Add section on permission boundariesSECURITY.md: Add quadrant-based IAM examplesFiles to Modify
Core Infrastructure:
applications/chatops/slack-bot/infrastructure/lib/slack-bot-stack.tsShared Logic:
applications/chatops/slack-bot/src/shared/command-registry.ts(new file)applications/chatops/slack-bot/src/shared/types.ts(addCommandCategorytype)Router:
applications/chatops/slack-bot/src/router/index.tsDocumentation:
applications/chatops/slack-bot/README.mdARCHITECTURE.mdSECURITY.mdTests:
applications/chatops/slack-bot/tests/unit/command-classifier.test.ts(new file)Success Criteria
Future Phases (Not in this PR)
This pull request was created from Copilot chat.
✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.