Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
188 changes: 188 additions & 0 deletions docs/plans/FREE_TIER_SCALABILITY_EXECUTION_REQUIREMENTS_APRIL_2026.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,188 @@
# Free-Tier Scalability Execution Requirements (April 2026)

## Purpose

This document converts the Free-Tier Scalability Plan into an execution contract for an implementation teammate.

Primary outcome:
- Production-safe rollout for 500+ users on free-tier constraints.
- Clear Definition of Done with mandatory passing tests.

Source plan:
- docs/plans/FREE_TIER_SCALABILITY_PLAN_APRIL_2026.md

## Implementation Context Snapshot

Current state observed in codebase:
- Raw body parsing is already scoped to webhook routes.
- Webhook signature middleware does not fail closed when secret is missing.
- In-memory caches exist, but env values are parsed without strict clamp/validation.
- Webhook push handling is per-commit and emits repeated updates.
- Socket presence and delivery maps are process-local.
- Queue worker path and saturation metrics are not yet complete.

## Scope

In scope:
1. Phase 1 hardening: webhook fail-closed, config validation and clamps, pagination safety checks.
2. Phase 2 throughput improvements: commit aggregation, fanout reduction, reconnect efficiency, log-level guards.
3. Phase 3 async isolation: queue-backed heavy processing for webhook and AI architecture analysis.
4. Phase 4 guardrails: runtime metrics, latency, queue depth, alerts, and overload controls.
Comment on lines +26 to +30
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The document mixes phase naming schemes: Scope refers to “Phase 1–4”, but the execution sections are labeled “Phase A–D”. This makes it harder to cross-reference with the source plan; consider using one consistent phase numbering/lettering throughout (and ideally match the source plan’s Phase 1–4).

Copilot uses AI. Check for mistakes.

Out of scope for this cycle:
1. Full horizontal multi-instance socket state coordination.
2. Major domain model rewrites.
3. Vendor/platform migration.

## Work Plan (Teammate Execution)

## Phase A: Security and Memory Safety Gate

Tasks:
1. Make GitHub webhook verification fail closed in production if secret is absent.
2. Reject unsigned requests and invalid signatures with 401.
3. Add strict env validation utilities for cache and catchup settings:
- Positive integer only.
- Hard min and max bounds.
- Safe fallback defaults.
4. Apply validated values to:
- ARCHITECTURE_CACHE_MAX_ENTRIES
- ARCHITECTURE_CACHE_TTL_MS
- DELIVERY_CATCHUP_BATCH_SIZE
- DELIVERY_CATCHUP_MAX_BATCHES
5. Confirm pagination boundary caps for any endpoint accepting page size style inputs.

Acceptance criteria:
1. Production webhook endpoints cannot process events when secret/config is missing.
2. Invalid env values cannot disable pruning/TTL behavior.
3. Invalid signatures never enter heavy processing path.

## Phase B: Throughput per MB

Tasks:
1. Aggregate webhook commit effects by project and task before DB writes.
2. Replace per-commit projectUpdate fanout with one consolidated project update per webhook event.
3. Minimize repeated reads for project/step/task lookup during webhook bursts.
4. Add log-level checks so debug logs do not flood hot paths.

Acceptance criteria:
1. Push bursts trigger bounded DB operations instead of commit-amplified loops.
2. One webhook event yields one project-level broadcast per affected project.
3. p95 latency and peak memory improve under synthetic push bursts.

## Phase C: Async Queue Isolation

Tasks:
1. Move heavy webhook processing to internal queue worker path.
2. Move AI architecture analysis to async jobs with status endpoint.
3. Keep API handlers lightweight and immediately acknowledge accepted work.
4. Implement idempotency for webhook deliveries.

Acceptance criteria:
1. API remains responsive during webhook and AI spikes.
2. Queue backlog, retry state, and failure reasons are visible.
3. Duplicate webhook deliveries do not duplicate side effects.

## Phase D: Operational Guardrails

Tasks:
1. Expose metrics for rss, heapUsed, heapTotal, event-loop lag, route latency.
2. Expose queue depth and queue lag metrics.
3. Add pressure controls:
- Defer non-critical work when memory threshold is crossed.
- Preserve core auth/session/chat behavior.
4. Add alert thresholds tied to free-tier limits.

Acceptance criteria:
1. Saturation is detected before user-visible failure.
2. Degradation mode is controlled and reversible.

## Definition of Done

All items below are mandatory:
1. Phase A complete and merged.
2. Phase B complete and merged.
3. Queue-based processing path merged for at least webhook heavy work.
4. Metrics endpoint/logging added with memory, latency, and queue indicators.
Comment on lines +103 to +106
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Phase C tasks include moving AI architecture analysis to async jobs, but the Definition of Done only mandates a queue-based path for webhook work (“at least webhook heavy work”). Either update DoD to explicitly require the AI offload too, or clarify that the AI portion of Phase C is optional/out of scope for this execution cycle so expectations are unambiguous.

Copilot uses AI. Check for mistakes.
5. Mandatory test gate passes.
6. Verification report attached to PR.

## Mandatory Test Gate (Must Pass)

Run from repository root:

1. Unit and integration baseline
- npm run test:jest

2. Type safety
- npm run typecheck

3. Frontend lint baseline
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The label “Frontend lint baseline” is misleading: npm run lint runs eslint . (repo-wide), not frontend-only. Rename this step (or split into separate frontend/backend lint commands if that’s the intent) so the test gate matches what actually runs.

Suggested change
3. Frontend lint baseline
3. Repository-wide lint baseline

Copilot uses AI. Check for mistakes.
- npm run lint

4. Build validation
- npm run build

Expected current Jest-discovered suites include:
1. backend/tests/user_sync_security.test.js
2. backend/tests/project_security.test.js
3. backend/tests/session_security.test.js
4. backend/tests/regexUtils.test.js

PR cannot be marked complete if any command above fails.

## Recommended Additional Tests for This Workstream

These should be added in the same implementation branch:
1. Webhook verification tests:
- Missing secret in production returns reject status.
- Missing signature rejected.
- Invalid signature rejected.
- Valid signature accepted.
2. Config clamp tests:
- NaN, zero, negative, huge values are clamped/fallback correctly.
3. Webhook aggregation tests:
- Multi-commit payload yields consolidated DB update behavior.
- Single projectUpdate event emitted per project per webhook payload.
4. Queue idempotency tests:
- Duplicate delivery id does not re-run side effects.

## Suggested PR Breakdown (Rollback Safe)

PR 1: Security and config hardening
1. Fail-closed webhook behavior.
2. Shared env clamp utility.
3. Tests for verification and clamp behavior.

PR 2: Webhook aggregation and fanout reduction
1. Consolidated processing path.
2. Socket emit dedup/consolidation.
3. Burst scenario tests.

PR 3: Queue offload for webhook and AI
1. Producer plus worker implementation.
2. Job status endpoints.
3. Idempotency tests.

PR 4: Metrics and overload guardrails
1. Runtime and queue metrics.
2. Alert thresholds and controlled degradation.
3. Smoke verification under synthetic load.

## PR Verification Template (Attach to Every PR)

1. Scope completed:
2. Non-goals untouched:
3. Risks introduced:
4. Test commands run:
5. Test results summary:
6. Rollback plan:
7. Follow-up tasks:

## Owner and Timeline

Owner: Assigned teammate
Reviewer: Engineering lead
Target window: 6 to 10 working days across four PRs

Status: Ready for execution
Loading