Add E2E Performance Tracking and Testing Infrastructure#18
Merged
Conversation
- Add performance-tests directory with Artillery configurations - Add CloudWatch Logs analysis scripts (analyze-performance.sh, analyze-e2e-json.sh) - Add Slack signature processor for load testing - Add test scenarios: echo-only, echo-light, full config 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
- Add correlation_id and api_gateway_start_time to SlackCommand and WorkerMessage types - Track request start time in router Lambda (API Gateway entry point) - Calculate and log comprehensive performance metrics in echo worker: - totalE2eMs: End-to-end latency from API Gateway to worker completion - workerDurationMs: Worker processing time - queueWaitMs: SQS queue wait time (difference between E2E and worker duration) - syncResponseMs: Synchronous Slack response time - asyncResponseMs: Asynchronous Slack response time - Enable CloudWatch Insights analysis of latency breakdown across system components 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
- Update analyze-performance.sh to query Performance metrics log - Replace correlation-based E2E calculation with structured metrics - Add component breakdown: queueWaitMs, workerDurationMs, syncResponseMs, asyncResponseMs - Update summary table to show actual metrics instead of estimates - Adjust Key Metrics thresholds based on real data - Update analyze-e2e-json.sh to extract all performance metrics - Add syncResponseMs and asyncResponseMs to E2E query - Add p50 percentile for better distribution analysis - Filter by 'Performance metrics' message for accurate data Both scripts now leverage the structured performance logging added in previous commit. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
- Create logRouterMetrics() helper function for router Lambda - Track statusCode, duration, success/failure - Log authentication errors (401) with error type - Log server errors (500) with error details - Log successful requests (200) with command info - Create logWorkerMetrics() helper function for worker Lambda - Add 'success' boolean field to all performance metrics - Add 'errorType' and 'errorMessage' for failed requests - Log metrics even when processing fails - Always include correlationId and command when available - Enable CloudWatch Insights queries for error analysis: - Error rate calculation: count(success=false) / count(*) - Error type distribution - Performance comparison: success vs failure cases - Router vs Worker error breakdown Example queries enabled: - fields success, errorType | filter message = "Performance metrics" | stats count() by success, errorType - fields statusCode, duration | filter message = "Router performance metrics" | stats avg(duration) by statusCode 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Makefile improvements: - Consolidate 11 perf-test targets into 5 clean commands - Add PERF_PROFILE variable (minimal|light|full) with default=minimal - Simplify command names: perf-test, perf-analyze, perf-summary, perf-report, perf-clean - Fix perf-analyze-quiet to output JSON metrics file instead of suppressing everything - Reduce code duplication by 60% (116 lines → 64 lines) New minimal test profile (artillery-echo-minimal.yml): - Duration: 60 seconds (vs 420s for light) - Requests: ~121 (vs ~5,460 for light) - Cost reduction: 97.8% - Provides statistically valid P50/P95/P99 metrics - Perfect for quick validation and CI/CD Performance script improvements: - Skip CloudWatch Metrics in quiet mode (prevents failures) - Output only essential progress messages in quiet mode - Generate .metrics.json file for programmatic access Usage examples: make perf-test # Run minimal (1 min, 121 reqs) make perf-test PERF_PROFILE=light # Run light (7 min, 5,460 reqs) make perf-test PERF_PROFILE=full # Run full (12 min, all commands) make perf-analyze # Analyze with full output make perf-analyze-quiet # Analyze quietly, save to JSON make perf-summary # Quick summary 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Problem: - Performance tests use fake Slack URLs - Fake URLs return 404 errors - Lambda fails before logging complete performance metrics Solution: - Use special mock URL: /test/perf-test-mock - Worker detects and skips Slack API call for this URL - Lambda completes successfully with full metrics Changes: - slack-signature-processor.js: Generate mock URL for tests - slack-client.ts: Skip API call if URL contains /test/perf-test-mock - No Lambda environment variable changes needed - Real Slack URLs unaffected Benefits: - All e2e metrics logged: totalE2eMs, queueWaitMs, syncResponseMs, asyncResponseMs - DLQ no longer fills with test failures - Performance tests now generate complete data 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
- Enhanced E2E metrics visualization with highlighted cards - Reorganized report layout: E2E metrics → Component breakdown → Service metrics - Added responsive design for mobile/tablet (768px, 480px breakpoints) - Implemented consistent color palette across all charts - Added E2E Timeline breakdown chart with stacked bars and percentage display - Improved chart tooltips with 'index' mode for better UX - Added Service Latency Distribution chart with percentile comparison - Removed MAX column from comparison (data not available for E2E) - Added note about independent percentile calculations - Created E2E Component Details table showing time breakdown - Improved chart interaction: larger point radius, hover effects - Removed 2000ms artificial delay from echo worker for realistic testing 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
- Add capture-report.js script using Puppeteer to convert HTML reports to PNG - Add 'make perf-capture' command to Makefile for easy screenshot generation - Fix Artillery aggregate data parsing in analyze-e2e-json.sh - Use aggregate.counters instead of summary (Artillery v2 format) - Calculate errorRate, avgRps, and duration correctly - Fix duplicate error counting in render-report.js - Only count top-level 'errors.*' to avoid duplicates - Artillery creates both 'errors.ETIMEDOUT' and 'scenario.errors.ETIMEDOUT' - Add puppeteer as dev dependency for screenshot generation Screenshot captures full HTML report at 1400px width with 2x scale for high quality. Error counts now accurate (was showing 4406 instead of 2203 for timeouts).
- Add http.timeout: 30 to prevent connection timeouts during high load - Add http.pool: 50 for connection pool management - Fixes 2,203 ETIMEDOUT errors observed in previous test run - Enables reliable testing at 40 req/s peak load
applications/chatops/slack-bot/performance-tests/slack-signature-processor.js
Fixed
Show fixed
Hide fixed
applications/chatops/slack-bot/performance-tests/slack-signature-processor.js
Fixed
Show fixed
Hide fixed
Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
This PR adds comprehensive end-to-end performance tracking and testing infrastructure for the Slack chatops bot, enabling systematic performance monitoring through correlation ID tracking, structured metrics logging, and automated Artillery-based load testing with CloudWatch integration.
Key Changes:
- Correlation ID tracking throughout the request lifecycle (API Gateway → Router → Worker)
- Structured performance metrics logging with E2E latency breakdown
- Artillery load testing infrastructure with multiple test profiles (minimal/light/standard/full)
- Interactive HTML performance reports with Chart.js visualizations
Reviewed changes
Copilot reviewed 21 out of 22 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
src/workers/echo/index.ts |
Added E2E metrics logging with latency breakdown (sync/async response, queue wait time) |
src/router/index.ts |
Implemented correlation ID tracking and router performance metrics |
src/shared/types.ts |
Added correlation_id and api_gateway_start_time fields for E2E tracking |
src/shared/slack-client.ts |
Added mock URL detection to skip Slack API calls during performance tests |
performance-tests/analyze-performance.sh |
CloudWatch Logs Insights queries for aggregating performance metrics |
performance-tests/analyze-e2e-json.sh |
JSON output generation for dashboard integration |
performance-tests/view-results.sh |
Terminal-based test results viewer |
performance-tests/render-report.js |
HTML report generator with Chart.js visualizations |
performance-tests/slack-signature-processor.js |
Artillery processor for Slack signature generation |
performance-tests/artillery-*.yml |
Multiple test configurations for different load profiles |
Makefile |
Added performance testing targets (perf-test, perf-analyze, perf-report) |
package.json |
Added puppeteer for screenshot capture functionality |
.gitignore |
Excluded test results and environment files |
Files not reviewed (1)
- applications/chatops/slack-bot/package-lock.json: Language not supported
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
applications/chatops/slack-bot/performance-tests/analyze-performance.sh
Outdated
Show resolved
Hide resolved
- Changed all Korean comments to English for better code readability - Improved international collaboration - No functional changes 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Contributor
Contributor
…ries (#20) - Initial plan - Replace hardcoded sleep with polling mechanism for CloudWatch Logs queries - Improve error handling in wait_for_query_completion function Co-authored-by: llama90 <6668548+llama90@users.noreply.github.com> * fix: fix analyze-performance.sh timestamp and CloudWatch query issues - Fix .metrics.json being selected instead of test results - Fix timestamp conversion from milliseconds to seconds for AWS CLI - Add dynamic CloudWatch period calculation to avoid 1440 datapoint limit - Fix AWS statistics parameter format (space-separated instead of comma) - Add error handling for CloudWatch queries 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: llama90 <6668548+llama90@users.noreply.github.com> Co-authored-by: Hyunseok Seo <hsseo0501@gmail.com> Co-authored-by: Claude <noreply@anthropic.com>
llama90
added a commit
that referenced
this pull request
Jan 11, 2026
Restore structured performance metrics logging that was present in the original echo worker, enabling E2E latency tracking and component breakdown analysis in CloudWatch. Implementation matches the original echo/index.ts pattern from PR #18: - SR worker collects timing metrics and logs 'Performance metrics' - Handler returns syncResponseMs and asyncResponseMs - Worker calculates E2E, queue wait, and total duration - Metrics logged for both success and failure cases Performance metrics fields: - totalE2eMs: API Gateway → final response (end-to-end) - workerDurationMs: Lambda execution time - queueWaitMs: Time message spent in SQS (calculated) - syncResponseMs: Sync Slack response time (from handler) - asyncResponseMs: Async Slack response time (from handler) - component: 'sr-worker' for CloudWatch filtering - correlationId, command, success, errorType, errorMessage Changes: - Removed artificial 2-second sleep delay from echo handler - Echo handler now returns HandlerResult with timing metrics - SR worker logs structured metrics via logWorkerMetrics() This restores server-side metrics collection after the quadrant-based refactor, enabling performance test analysis scripts to work correctly.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds comprehensive end-to-end (E2E) performance tracking and testing infrastructure for the Slack chatops bot, enabling systematic performance monitoring and regression testing.
Key Features
🎯 E2E Performance Tracking
📊 Artillery Load Testing
📈 Performance Report Dashboard
🔧 Testing Tools
Changes
New Files
performance-tests/: Complete testing infrastructure.env.example: Environment configuration templateREADME.md: Comprehensive testing documentationModified Files
src/router/index.ts: Added performance logging, correlation ID trackingsrc/workers/echo/index.ts: Added E2E metrics, removed 2s test delaysrc/shared/types.ts: Addedapi_gateway_start_timefor E2E trackingsrc/shared/slack-client.ts: Enhanced error handling with retry logicMakefile: Added perf-test targets (minimal/light/standard/full).gitignore: Excluded test results and.envfilesPerformance Impact
Removed artificial delay: Echo worker previously had a 2000ms sleep for "simulating async work" - now removed for realistic testing.
Current baseline (140 requests over 59s):
Usage
Test Coverage
Security
.env(gitignored)Screenshots
See attached HTML report screenshots showing:
🤖 Generated with Claude Code
Co-Authored-By: Claude noreply@anthropic.com