Add E2E Performance Tracking and Testing Infrastructure by llama90 · Pull Request #18 · llamandcoco/cloud-apps

llama90 · 2026-01-03T01:17:56Z

Summary

Adds comprehensive end-to-end (E2E) performance tracking and testing infrastructure for the Slack chatops bot, enabling systematic performance monitoring and regression testing.

Key Features

🎯 E2E Performance Tracking

Added correlation ID tracking from API Gateway → Router → Worker
Implemented performance metrics logging at each stage:
- Router: API Gateway to EventBridge duration
- Worker: Queue wait time + processing duration
- E2E: Total time from user request to worker completion
Captures detailed breakdown: sync response, queue wait, worker processing, async response

📊 Artillery Load Testing

Multiple test profiles: minimal (5 VUs), light (20 VUs), standard (50 VUs), full (100 VUs)
Automated test execution via Makefile targets
CloudWatch metrics collection during test window
Structured metrics output (JSON + HTML reports)

📈 Performance Report Dashboard

Interactive HTML reports with Chart.js visualizations:
- Service Latency Distribution (Router/Worker/E2E comparison)
- E2E Timeline Breakdown (stacked bar with percentages)
- Throughput metrics (RPS, total processed)
- Percentile charts (P50/P95/P99)
Test execution info: environment, timestamp, duration
Responsive UI: mobile/tablet/desktop support
E2E metrics emphasized with highlighted cards

🔧 Testing Tools

Slack signature generation: Proper request signing for realistic tests
Mock response URL: Captures Slack webhook responses for validation
CloudWatch Insights integration: Automated metric queries
Result viewing: Quick HTML report viewer

Changes

New Files

performance-tests/: Complete testing infrastructure
- Artillery YAML configs (4 profiles)
- Analysis scripts (CloudWatch Insights queries)
- Report rendering (HTML + Chart.js)
- Helper scripts (curl test, signature generation)
.env.example: Environment configuration template
README.md: Comprehensive testing documentation

Modified Files

src/router/index.ts: Added performance logging, correlation ID tracking
src/workers/echo/index.ts: Added E2E metrics, removed 2s test delay
src/shared/types.ts: Added api_gateway_start_time for E2E tracking
src/shared/slack-client.ts: Enhanced error handling with retry logic
Makefile: Added perf-test targets (minimal/light/standard/full)
.gitignore: Excluded test results and .env files

Performance Impact

Removed artificial delay: Echo worker previously had a 2000ms sleep for "simulating async work" - now removed for realistic testing.

Current baseline (140 requests over 59s):

E2E Throughput: 2.36 req/s
E2E Avg Latency: 170ms (P50: 142ms, P95: 211ms, P99: 1352ms)
Router Avg: 40ms (P95: 71ms)
Worker Avg: 6ms (P95: 13ms)
Queue Wait: 169ms average (main bottleneck)

Usage

# Run minimal test (5 VUs)
make perf-test-minimal

# Run standard test (50 VUs)
make perf-test

# View latest results
./performance-tests/view-results.sh

Test Coverage

✅ E2E request flow validation
✅ Slack signature verification
✅ Router → EventBridge → SQS → Worker flow
✅ Concurrent load handling
✅ CloudWatch metrics correlation
✅ Error tracking and reporting

Security

All sensitive values (Slack signing secret) stored in SSM Parameter Store
Environment variables managed via .env (gitignored)
Test results and artifacts excluded from git
No hardcoded credentials or secrets

Screenshots

See attached HTML report screenshots showing:

Throughput metrics dashboard
Service latency comparisons
E2E component breakdown
Responsive layout on different screen sizes

🤖 Generated with Claude Code

Co-Authored-By: Claude noreply@anthropic.com

- Add performance-tests directory with Artillery configurations - Add CloudWatch Logs analysis scripts (analyze-performance.sh, analyze-e2e-json.sh) - Add Slack signature processor for load testing - Add test scenarios: echo-only, echo-light, full config 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

- Add correlation_id and api_gateway_start_time to SlackCommand and WorkerMessage types - Track request start time in router Lambda (API Gateway entry point) - Calculate and log comprehensive performance metrics in echo worker: - totalE2eMs: End-to-end latency from API Gateway to worker completion - workerDurationMs: Worker processing time - queueWaitMs: SQS queue wait time (difference between E2E and worker duration) - syncResponseMs: Synchronous Slack response time - asyncResponseMs: Asynchronous Slack response time - Enable CloudWatch Insights analysis of latency breakdown across system components 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

- Update analyze-performance.sh to query Performance metrics log - Replace correlation-based E2E calculation with structured metrics - Add component breakdown: queueWaitMs, workerDurationMs, syncResponseMs, asyncResponseMs - Update summary table to show actual metrics instead of estimates - Adjust Key Metrics thresholds based on real data - Update analyze-e2e-json.sh to extract all performance metrics - Add syncResponseMs and asyncResponseMs to E2E query - Add p50 percentile for better distribution analysis - Filter by 'Performance metrics' message for accurate data Both scripts now leverage the structured performance logging added in previous commit. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

- Create logRouterMetrics() helper function for router Lambda - Track statusCode, duration, success/failure - Log authentication errors (401) with error type - Log server errors (500) with error details - Log successful requests (200) with command info - Create logWorkerMetrics() helper function for worker Lambda - Add 'success' boolean field to all performance metrics - Add 'errorType' and 'errorMessage' for failed requests - Log metrics even when processing fails - Always include correlationId and command when available - Enable CloudWatch Insights queries for error analysis: - Error rate calculation: count(success=false) / count(*) - Error type distribution - Performance comparison: success vs failure cases - Router vs Worker error breakdown Example queries enabled: - fields success, errorType | filter message = "Performance metrics" | stats count() by success, errorType - fields statusCode, duration | filter message = "Router performance metrics" | stats avg(duration) by statusCode 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Makefile improvements: - Consolidate 11 perf-test targets into 5 clean commands - Add PERF_PROFILE variable (minimal|light|full) with default=minimal - Simplify command names: perf-test, perf-analyze, perf-summary, perf-report, perf-clean - Fix perf-analyze-quiet to output JSON metrics file instead of suppressing everything - Reduce code duplication by 60% (116 lines → 64 lines) New minimal test profile (artillery-echo-minimal.yml): - Duration: 60 seconds (vs 420s for light) - Requests: ~121 (vs ~5,460 for light) - Cost reduction: 97.8% - Provides statistically valid P50/P95/P99 metrics - Perfect for quick validation and CI/CD Performance script improvements: - Skip CloudWatch Metrics in quiet mode (prevents failures) - Output only essential progress messages in quiet mode - Generate .metrics.json file for programmatic access Usage examples: make perf-test # Run minimal (1 min, 121 reqs) make perf-test PERF_PROFILE=light # Run light (7 min, 5,460 reqs) make perf-test PERF_PROFILE=full # Run full (12 min, all commands) make perf-analyze # Analyze with full output make perf-analyze-quiet # Analyze quietly, save to JSON make perf-summary # Quick summary 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Problem: - Performance tests use fake Slack URLs - Fake URLs return 404 errors - Lambda fails before logging complete performance metrics Solution: - Use special mock URL: /test/perf-test-mock - Worker detects and skips Slack API call for this URL - Lambda completes successfully with full metrics Changes: - slack-signature-processor.js: Generate mock URL for tests - slack-client.ts: Skip API call if URL contains /test/perf-test-mock - No Lambda environment variable changes needed - Real Slack URLs unaffected Benefits: - All e2e metrics logged: totalE2eMs, queueWaitMs, syncResponseMs, asyncResponseMs - DLQ no longer fills with test failures - Performance tests now generate complete data 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

- Enhanced E2E metrics visualization with highlighted cards - Reorganized report layout: E2E metrics → Component breakdown → Service metrics - Added responsive design for mobile/tablet (768px, 480px breakpoints) - Implemented consistent color palette across all charts - Added E2E Timeline breakdown chart with stacked bars and percentage display - Improved chart tooltips with 'index' mode for better UX - Added Service Latency Distribution chart with percentile comparison - Removed MAX column from comparison (data not available for E2E) - Added note about independent percentile calculations - Created E2E Component Details table showing time breakdown - Improved chart interaction: larger point radius, hover effects - Removed 2000ms artificial delay from echo worker for realistic testing 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

- Add capture-report.js script using Puppeteer to convert HTML reports to PNG - Add 'make perf-capture' command to Makefile for easy screenshot generation - Fix Artillery aggregate data parsing in analyze-e2e-json.sh - Use aggregate.counters instead of summary (Artillery v2 format) - Calculate errorRate, avgRps, and duration correctly - Fix duplicate error counting in render-report.js - Only count top-level 'errors.*' to avoid duplicates - Artillery creates both 'errors.ETIMEDOUT' and 'scenario.errors.ETIMEDOUT' - Add puppeteer as dev dependency for screenshot generation Screenshot captures full HTML report at 1400px width with 2x scale for high quality. Error counts now accurate (was showing 4406 instead of 2203 for timeouts).

- Add http.timeout: 30 to prevent connection timeouts during high load - Add http.pool: 50 for connection pool management - Fixes 2,203 ETIMEDOUT errors observed in previous test run - Enables reliable testing at 40 req/s peak load

applications/chatops/slack-bot/performance-tests/slack-signature-processor.js

Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>

Copilot

Pull request overview

This PR adds comprehensive end-to-end performance tracking and testing infrastructure for the Slack chatops bot, enabling systematic performance monitoring through correlation ID tracking, structured metrics logging, and automated Artillery-based load testing with CloudWatch integration.

Key Changes:

Correlation ID tracking throughout the request lifecycle (API Gateway → Router → Worker)
Structured performance metrics logging with E2E latency breakdown
Artillery load testing infrastructure with multiple test profiles (minimal/light/standard/full)
Interactive HTML performance reports with Chart.js visualizations

Reviewed changes

Copilot reviewed 21 out of 22 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
`src/workers/echo/index.ts`	Added E2E metrics logging with latency breakdown (sync/async response, queue wait time)
`src/router/index.ts`	Implemented correlation ID tracking and router performance metrics
`src/shared/types.ts`	Added correlation_id and api_gateway_start_time fields for E2E tracking
`src/shared/slack-client.ts`	Added mock URL detection to skip Slack API calls during performance tests
`performance-tests/analyze-performance.sh`	CloudWatch Logs Insights queries for aggregating performance metrics
`performance-tests/analyze-e2e-json.sh`	JSON output generation for dashboard integration
`performance-tests/view-results.sh`	Terminal-based test results viewer
`performance-tests/render-report.js`	HTML report generator with Chart.js visualizations
`performance-tests/slack-signature-processor.js`	Artillery processor for Slack signature generation
`performance-tests/artillery-*.yml`	Multiple test configurations for different load profiles
`Makefile`	Added performance testing targets (perf-test, perf-analyze, perf-report)
`package.json`	Added puppeteer for screenshot capture functionality
`.gitignore`	Excluded test results and environment files

Files not reviewed (1)

applications/chatops/slack-bot/package-lock.json: Language not supported

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

applications/chatops/slack-bot/performance-tests/view-results.sh

applications/chatops/slack-bot/src/workers/echo/index.ts

applications/chatops/slack-bot/src/shared/slack-client.ts

applications/chatops/slack-bot/src/router/index.ts

applications/chatops/slack-bot/performance-tests/analyze-performance.sh

- Changed all Korean comments to English for better code readability - Improved international collaboration - No functional changes 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Copilot · 2026-01-03T14:35:55Z

@llama90 I've opened a new pull request, #19, to work on those changes. Once the pull request is ready, I'll request review from you.

) - Initial plan - Fix URL check to use startsWith instead of includes for better specificity --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: llama90 <6668548+llama90@users.noreply.github.com>

Copilot · 2026-01-03T19:08:41Z

@llama90 I've opened a new pull request, #20, to work on those changes. Once the pull request is ready, I'll request review from you.

…ries (#20) - Initial plan - Replace hardcoded sleep with polling mechanism for CloudWatch Logs queries - Improve error handling in wait_for_query_completion function Co-authored-by: llama90 <6668548+llama90@users.noreply.github.com> * fix: fix analyze-performance.sh timestamp and CloudWatch query issues - Fix .metrics.json being selected instead of test results - Fix timestamp conversion from milliseconds to seconds for AWS CLI - Add dynamic CloudWatch period calculation to avoid 1440 datapoint limit - Fix AWS statistics parameter format (space-separated instead of comma) - Add error handling for CloudWatch queries 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: llama90 <6668548+llama90@users.noreply.github.com> Co-authored-by: Hyunseok Seo <hsseo0501@gmail.com> Co-authored-by: Claude <noreply@anthropic.com>

Restore structured performance metrics logging that was present in the original echo worker, enabling E2E latency tracking and component breakdown analysis in CloudWatch. Implementation matches the original echo/index.ts pattern from PR #18: - SR worker collects timing metrics and logs 'Performance metrics' - Handler returns syncResponseMs and asyncResponseMs - Worker calculates E2E, queue wait, and total duration - Metrics logged for both success and failure cases Performance metrics fields: - totalE2eMs: API Gateway → final response (end-to-end) - workerDurationMs: Lambda execution time - queueWaitMs: Time message spent in SQS (calculated) - syncResponseMs: Sync Slack response time (from handler) - asyncResponseMs: Async Slack response time (from handler) - component: 'sr-worker' for CloudWatch filtering - correlationId, command, success, errorType, errorMessage Changes: - Removed artificial 2-second sleep delay from echo handler - Echo handler now returns HandlerResult with timing metrics - SR worker logs structured metrics via logWorkerMetrics() This restores server-side metrics collection after the quadrant-based refactor, enabling performance test analysis scripts to work correctly.

llama90 and others added 11 commits January 1, 2026 21:39

chore: add performance test results to gitignore

826eedd

Improve perf metrics parity indicators

9c1b0d0

github-advanced-security bot found potential problems Jan 3, 2026

View reviewed changes

applications/chatops/slack-bot/performance-tests/slack-signature-processor.js Fixed Show fixed Hide fixed

applications/chatops/slack-bot/performance-tests/slack-signature-processor.js Fixed Show fixed Hide fixed

llama90 added the perf label Jan 3, 2026

Potential fix for code scanning alert no. 7: Insecure randomness

4729e05

Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>

llama90 requested a review from Copilot January 3, 2026 01:30

Copilot started reviewing on behalf of llama90 January 3, 2026 01:31 View session

Copilot AI reviewed Jan 3, 2026

View reviewed changes

Copilot AI mentioned this pull request Jan 3, 2026

Use startsWith() for perf test URL check to prevent false positives #19

Merged

Copilot AI mentioned this pull request Jan 3, 2026

Replace hardcoded sleep with polling for CloudWatch Logs Insights queries #20

Merged

llama90 merged commit 829c35b into main Jan 4, 2026
5 checks passed

llama90 deleted the feat/e2e-performance-tracking branch January 4, 2026 03:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add E2E Performance Tracking and Testing Infrastructure#18

Add E2E Performance Tracking and Testing Infrastructure#18
llama90 merged 15 commits intomainfrom
feat/e2e-performance-tracking

llama90 commented Jan 3, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI commented Jan 3, 2026

Uh oh!

Copilot AI commented Jan 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

llama90 commented Jan 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Key Features

🎯 E2E Performance Tracking

📊 Artillery Load Testing

📈 Performance Report Dashboard

🔧 Testing Tools

Changes

New Files

Modified Files

Performance Impact

Usage

Test Coverage

Security

Screenshots

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI commented Jan 3, 2026

Uh oh!

Copilot AI commented Jan 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

llama90 commented Jan 3, 2026 •

edited

Loading