Skip to content

feat: Scale & Ops Maturity — agent-first infrastructure#86

Merged
jeremylongshore merged 3 commits intomainfrom
feat/scale-ops-maturity
Feb 20, 2026
Merged

feat: Scale & Ops Maturity — agent-first infrastructure#86
jeremylongshore merged 3 commits intomainfrom
feat/scale-ops-maturity

Conversation

@jeremylongshore
Copy link
Contributor

@jeremylongshore jeremylongshore commented Feb 20, 2026

Summary

  • Dependabot + SBOM (gwi-nx0): Grouped dependency updates with gwi-triage labels for agent auto-scoring; CycloneDX SBOM generation on release via anchore/sbom-action
  • VPC + Budget Alerts (gwi-keq): Enable VPC connector in prod, GCP billing budget with Pub/Sub alerts, agent-queryable GET /tenants/:tenantId/budget with recommendation field (normal/conserve/stop)
  • Circuit Breaker for LLM (gwi-d1k): Wrap LLM providers with existing CircuitBreaker + ExponentialBackoffRetryHandler, provider health registry singleton, selection policy skips providers with open circuits, GET /health/providers endpoint
  • Run Steps → Subcollection (gwi-o06): StepStore interface + Firestore/InMemory implementations, cursor-based pagination, GET /tenants/:tenantId/runs/:runId/steps and .../steps/:stepId API endpoints
  • Distributed Rate Limiting (gwi-5a6): Agent-queryable GET /tenants/:tenantId/quota with per-action rate limit status and recommendation field

All new API endpoints return agent-actionable recommendation fields so autonomous agents can make cost-aware and reliability-aware decisions.

Test plan

  • npx tsc --noEmit passes for @gwi/core, apps/api, apps/gateway
  • 19 new tests pass: 10 step-store + 9 resilient-provider
  • YAML validation: dependabot.yml, sbom.yml valid
  • tofu fmt -check passes on budget.tf
  • npm run arv full gate pass
  • Verify /health/providers, /quota, /budget endpoints return valid JSON
  • Confirm Dependabot PRs arrive with gwi-triage label after merge

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Provider health monitoring endpoint
    • Budget tracking with spending alerts and recommendations
    • Rate limit quota management per action
    • Run steps tracking with pagination support
    • LLM provider resilience with circuit breaker and automatic retry
  • Infrastructure

    • Budget alerting system in production
    • VPC networking configuration
    • Automated dependency updates via Dependabot
    • SBOM generation for releases
  • Documentation

    • Agent ecosystem and integration strategy documentation

jeremylongshore and others added 3 commits February 18, 2026 23:01
…eries)

Five-document analysis covering agent ecosystem inventory, integration
strategy (standalone/medium/deep), engineering risks with mitigations,
Bob's Brain refactor options, and agent network vision.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Five audit-driven improvements making GWI infrastructure agent-operable:

- Dependabot + SBOM (gwi-nx0): grouped dependency updates with gwi-triage
  labels, CycloneDX SBOM generation on release
- VPC + Budget Alerts (gwi-keq): enable VPC connector, GCP billing budget
  with Pub/Sub alerts, agent-queryable GET /tenants/:tenantId/budget
- Circuit Breaker for LLM (gwi-d1k): wrap LLM providers with existing CB +
  retry, provider health registry, selection policy skips open circuits,
  GET /health/providers endpoint
- Run Steps Subcollection (gwi-o06): StepStore interface + Firestore/InMemory
  implementations, paginated step listing API endpoints
- Distributed Rate Limiting (gwi-5a6): agent-queryable quota endpoint
  GET /tenants/:tenantId/quota with per-action rate limit status

All endpoints return agent-actionable recommendation fields.
19 new tests pass (step-store + resilient-provider).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@coderabbitai
Copy link

coderabbitai bot commented Feb 20, 2026

📝 Walkthrough

Walkthrough

This PR introduces automated dependency management via Dependabot, SBOM generation, comprehensive strategic documentation (five new planning documents), new API endpoints for health monitoring, budget and quota tracking, run steps retrieval, LLM provider circuit breaker resilience with exponential backoff retry logic, and step-level storage abstraction with both in-memory and Firestore implementations.

Changes

Cohort / File(s) Summary
GitHub Automation & CI
.github/dependabot.yml, .github/workflows/sbom.yml, README.md
Adds Dependabot configuration for automated npm, Docker, and GitHub Actions updates on weekly schedule; introduces SBOM generation workflow on release/manual trigger with CycloneDX output and release asset upload.
Strategic Documentation
000-docs/241-...agent-ecosystem-inventory.md, 000-docs/242-...automaton-integration-strategy.md, 000-docs/243-...engineering-risks-mitigations.md, 000-docs/244-...bob-refactor-analysis.md, 000-docs/245-...agent-network-vision.md
Five new comprehensive planning documents covering agent ecosystem inventory, Automaton integration options with roadmap, engineering risk/mitigation analysis, Bob's Brain refactor decision framework, and Intent Agent Network architecture vision.
API Health & Observability Routes
apps/api/src/index.ts, apps/api/src/routes/budget.ts, apps/api/src/routes/quota.ts
Adds GET /health/providers endpoint, GET /tenants/:tenantId/budget endpoint with plan-based limits and metering integration, and GET /tenants/:tenantId/quota endpoint with per-action rate limit aggregation and recommendations.
Run Steps Storage & Retrieval
apps/api/src/index.ts
Introduces GET /tenants/:tenantId/runs/:runId/steps (paginated listing with cursor) and GET /tenants/:tenantId/runs/:runId/steps/:stepId (single step fetch) endpoints with tenant/run ownership verification.
LLM Provider Resilience & Health Registry
packages/core/src/llm/provider-health.ts, packages/core/src/llm/resilient-provider.ts, packages/core/src/llm/index.ts, packages/core/src/llm/selection-policy.ts, packages/core/src/llm/__tests__/resilient-provider.test.ts
Adds circuit breaker-based health registry with per-provider state tracking, resilient provider wrapper with exponential backoff retry handler for transient errors, health-aware provider selection filtering, and comprehensive test coverage.
Step Store Abstraction & Implementations
packages/core/src/storage/interfaces.ts, packages/core/src/storage/inmemory.ts, packages/core/src/storage/firestore-step.ts, packages/core/src/storage/index.ts, packages/core/src/storage/__tests__/step-store.test.ts
Introduces StepStore interface and PaginatedResult type; adds InMemoryStepStore and FirestoreStepStore implementations supporting addStep, getStep, listSteps (with cursor pagination), and updateStepStatus operations; provides singleton accessor with backend selection.
Infrastructure Budget Alerting
infra/budget.tf, infra/envs/prod.tfvars
Adds Terraform budget alert infrastructure with Pub/Sub topic, Google Cloud Billing Budget resource gated by enable_budget_alerts flag, configurable thresholds, and alert outputs; enables VPC connector for prod with private-ranges-only egress.

Sequence Diagram(s)

sequenceDiagram
    participant Client as Client/Agent
    participant Resilient as Resilient Provider<br/>(Wrapper)
    participant CircuitBreaker as Circuit Breaker
    participant RetryHandler as Exponential Backoff<br/>Retry Handler
    participant OriginalProvider as Original LLM<br/>Provider
    participant HealthRegistry as Health Registry &<br/>Failure Tracking

    Client->>Resilient: completeJson(request)
    Resilient->>CircuitBreaker: execute(providerKey, retryFn)
    
    alt Circuit OPEN
        CircuitBreaker-->>Resilient: reject with circuit open
        Resilient->>HealthRegistry: recordProviderFailure()
        Resilient-->>Client: error
    else Circuit CLOSED or HALF_OPEN
        CircuitBreaker->>RetryHandler: retry(fn)
        
        loop Retry Logic (up to maxRetries)
            RetryHandler->>OriginalProvider: call provider
            alt Retryable Error (429, 5xx, network)
                OriginalProvider-->>RetryHandler: error
                RetryHandler->>HealthRegistry: recordProviderFailure()
                RetryHandler->>RetryHandler: exponential backoff + jitter
            else Success
                OriginalProvider-->>RetryHandler: response
                RetryHandler-->>CircuitBreaker: return response
            else Non-Retryable Error
                OriginalProvider-->>RetryHandler: error
                RetryHandler->>HealthRegistry: recordProviderFailure()
                RetryHandler-->>CircuitBreaker: rethrow error
            end
        end
        
        CircuitBreaker-->>Resilient: final result or error
        Resilient-->>Client: response or error
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Poem

🐰 Through resilient hops we go,
With circuits dancing to and fro,
Steps stored, budgets tracked with care,
Health checks blooming everywhere,
The agent network's destiny clear,
A brighter, safer network near! 🌟

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title 'feat: Scale & Ops Maturity — agent-first infrastructure' clearly summarizes the main focus of the changeset, which spans multiple operational maturity improvements (Dependabot, SBOM, VPC, budget alerts, circuit breaker, step storage, and rate limiting) centered around enabling autonomous agent capabilities.
Description check ✅ Passed The PR description covers the required template sections: Summary (with five feature areas and their technical details), Type of Change (Infrastructure/New feature implied), Testing (with specific test commands and results), and addresses security/breaking changes implicitly. The description is comprehensive and well-structured, detailing the agent-first architecture focus.
Docstring Coverage ✅ Passed Docstring coverage is 90.91% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feat/scale-ops-maturity

Comment @coderabbitai help to get the list of available commands and usage tips.

@qodo-code-review
Copy link

Review Summary by Qodo

Scale & Ops Maturity — Agent-first infrastructure with circuit breakers, step subcollections, and cost-aware APIs

✨ Enhancement 🧪 Tests

Grey Divider

Walkthroughs

Description
• Circuit breaker for LLM providers with health registry and selection policy integration
• Run steps subcollection with Firestore/in-memory stores and paginated API endpoints
• Budget and rate limit quota endpoints for agent-aware cost and reliability decisions
• Dependabot configuration with SBOM generation for automated dependency management
• VPC connector enabled in production and budget alerts via Pub/Sub
• 19 new tests covering step store and resilient provider implementations
• Strategic documentation on Automaton integration, engineering risks, and agent network vision
Diagram
flowchart LR
  LLM["LLM Providers"]
  CB["Circuit Breaker<br/>Health Registry"]
  SP["Selection Policy<br/>Skip Open Circuits"]
  
  Runs["Run Document"]
  Steps["Steps Subcollection<br/>Firestore/InMemory"]
  StepAPI["GET /steps<br/>GET /steps/:id"]
  
  Tenant["Tenant Store"]
  Budget["Budget API<br/>GET /budget"]
  Quota["Quota API<br/>GET /quota"]
  
  Deps["Dependabot<br/>gwi-triage label"]
  SBOM["SBOM Generation<br/>CycloneDX"]
  
  LLM --> CB
  CB --> SP
  SP -.->|skip unhealthy| LLM
  
  Runs --> Steps
  Steps --> StepAPI
  
  Tenant --> Budget
  Tenant --> Quota
  
  Deps --> SBOM
  
  Budget -.->|recommendation| SP
  Quota -.->|recommendation| SP
Loading

Grey Divider

File Changes

1. apps/api/src/index.ts ✨ Enhancement +89/-0

Add provider health and step listing endpoints

apps/api/src/index.ts


2. apps/api/src/routes/budget.ts ✨ Enhancement +147/-0

New budget status API with agent recommendations

apps/api/src/routes/budget.ts


3. apps/api/src/routes/quota.ts ✨ Enhancement +132/-0

New rate limit quota API with per-action status

apps/api/src/routes/quota.ts


View more (20)
4. packages/core/src/llm/provider-health.ts ✨ Enhancement +180/-0

Provider health registry with circuit breaker state

packages/core/src/llm/provider-health.ts


5. packages/core/src/llm/resilient-provider.ts ✨ Enhancement +113/-0

Resilient wrapper with retry and circuit breaker

packages/core/src/llm/resilient-provider.ts


6. packages/core/src/llm/index.ts ✨ Enhancement +19/-0

Export provider health and resilient provider APIs

packages/core/src/llm/index.ts


7. packages/core/src/llm/selection-policy.ts ✨ Enhancement +7/-0

Skip providers with open circuit breakers

packages/core/src/llm/selection-policy.ts


8. packages/core/src/storage/interfaces.ts ✨ Enhancement +40/-0

Define StepStore interface and PaginatedResult type

packages/core/src/storage/interfaces.ts


9. packages/core/src/storage/firestore-step.ts ✨ Enhancement +153/-0

Firestore implementation of step subcollection store

packages/core/src/storage/firestore-step.ts


10. packages/core/src/storage/inmemory.ts ✨ Enhancement +75/-0

In-memory implementation of step store with pagination

packages/core/src/storage/inmemory.ts


11. packages/core/src/storage/index.ts ✨ Enhancement +38/-2

Export step store implementations and getter function

packages/core/src/storage/index.ts


12. packages/core/src/llm/__tests__/resilient-provider.test.ts 🧪 Tests +195/-0

Tests for circuit breaker and retry resilience

packages/core/src/llm/tests/resilient-provider.test.ts


13. packages/core/src/storage/__tests__/step-store.test.ts 🧪 Tests +123/-0

Tests for step store pagination and status updates

packages/core/src/storage/tests/step-store.test.ts


14. .github/dependabot.yml ⚙️ Configuration changes +77/-0

Dependabot configuration with gwi-triage labels

.github/dependabot.yml


15. .github/workflows/sbom.yml ⚙️ Configuration changes +70/-0

SBOM generation workflow on release

.github/workflows/sbom.yml


16. infra/budget.tf ⚙️ Configuration changes +93/-0

GCP billing budget with Pub/Sub alert notifications

infra/budget.tf


17. infra/envs/prod.tfvars ⚙️ Configuration changes +10/-2

Enable VPC connector and budget alerts in production

infra/envs/prod.tfvars


18. README.md 📝 Documentation +4/-0

Add CI, Tests, and ARV status badges

README.md


19. 000-docs/241-PP-RMAP-agent-ecosystem-inventory.md 📝 Documentation +217/-0

Comprehensive inventory of Intent Solutions agent projects

000-docs/241-PP-RMAP-agent-ecosystem-inventory.md


20. 000-docs/242-PP-RMAP-automaton-integration-strategy.md 📝 Documentation +195/-0

Three integration strategies for Automaton with recommendations

000-docs/242-PP-RMAP-automaton-integration-strategy.md


21. 000-docs/243-PP-RMAP-engineering-risks-mitigations.md 📝 Documentation +197/-0

Risk assessment and mitigation strategies for Automaton

000-docs/243-PP-RMAP-engineering-risks-mitigations.md


22. 000-docs/244-PP-RMAP-bob-refactor-analysis.md 📝 Documentation +158/-0

Language and framework options for Bob's Brain refactor

000-docs/244-PP-RMAP-bob-refactor-analysis.md


23. 000-docs/245-PP-RMAP-agent-network-vision.md 📝 Documentation +200/-0

Long-term vision for Intent Agent Network economy

000-docs/245-PP-RMAP-agent-network-vision.md


Grey Divider

Qodo Logo

@github-actions
Copy link

🤖 Gemini 2.5 Flash Code Review

Files Reviewed: 15

Summary

This PR introduces significant "agent-first" infrastructure for scaling and operational maturity, including Dependabot configuration, SBOM generation, new API endpoints for LLM provider health, budget, and rate limits, and a new StepStore for granular run step management. The changes enhance observability, resilience, and agent decision-making capabilities.

Critical Issues 🔴

  1. Security: Inconsistent Authorization for Step Endpoints

    • File: apps/api/src/index.ts
    • Lines: 1753, 1772
    • Issue: The new /tenants/:tenantId/runs/:runId/steps and /tenants/:tenantId/runs/:runId/steps/:stepId endpoints correctly use authMiddleware, tenantAuthMiddleware, and requirePermission('run:read'). However, the existing /tenants/:tenantId/runs/:runId endpoint (line 1753) only uses authMiddleware and tenantAuthMiddleware, lacking the explicit requirePermission('run:read'). This inconsistency could lead to an authorization bypass for the full run document if tenantAuthMiddleware doesn't implicitly enforce run:read or if its implementation changes.
    • Action: Add requirePermission('run:read') to the /tenants/:tenantId/runs/:runId endpoint for consistency and explicit security.
  2. Correctness: getEngine() Call in Step Endpoints

    • File: apps/api/src/index.ts
    • Lines: 1763, 1782
    • Issue: Both new step endpoints call getEngine().getRun(tenantId, runId) solely to verify that the run belongs to the tenant and exists. This is an unnecessary database read if the tenantAuthMiddleware already verifies tenant ownership for the runId parameter. If tenantAuthMiddleware only verifies tenantId and not the relationship between tenantId and runId, then this check is necessary but potentially inefficient (see Performance).
    • Action: Clarify the responsibilities of tenantAuthMiddleware. If it already ensures req.params.runId belongs to req.params.tenantId, remove the redundant getEngine().getRun() call. If not, consider enhancing tenantAuthMiddleware or creating a specific runAuthMiddleware to handle this relationship more efficiently, potentially caching the run object if it's needed later in the request.
  3. Correctness: InMemoryStepStore Cursor Logic

    • File: packages/core/src/storage/inmemory.ts
    • Lines: 1079-1083
    • Issue: The cursor logic for InMemoryStepStore.listSteps might be incorrect. allSteps.findIndex(s => s.id === opts.cursor) finds the first occurrence. If allSteps is not guaranteed to be sorted by id (which it isn't, it's Array.from(runSteps.values())), then allSteps.slice(cursorIdx + 1) might skip or include incorrect items, leading to inconsistent pagination results. Firestore's startAfter relies on the orderBy clause.
    • Action: Ensure allSteps is consistently sorted by the same field used for the cursor (e.g., startedAt as in FirestoreStepStore) before applying the cursor logic. This might require adding a sort call: allSteps.sort((a, b) => a.startedAt.getTime() - b.startedAt.getTime()); or similar.

Suggestions 🟡

  1. Architecture: Centralize API Route Definitions

    • File: apps/api/src/index.ts
    • Lines: 684, 693, 1759, 1778
    • Issue: New routes are being added directly to apps/api/src/index.ts. While budget.ts and quota.ts are correctly extracted into separate modules, the /health/providers and /tenants/:tenantId/runs/:runId/steps routes are not. As the API grows, index.ts will become a monolithic file.
    • Action: Extract the /health/providers route into a health.ts module and the /tenants/:tenantId/runs/:runId/steps routes into a run-steps.ts module (or similar) within apps/api/src/routes/. Then, import and app.use() these routers in index.ts, similar to how createBudgetRoutes() and createQuotaRateLimitRoutes() are handled.
  2. Performance: N+1 Query Potential in Step Endpoints

    • File: apps/api/src/index.ts
    • Lines: 1763, 1782
    • Issue: As noted in Critical Issue feat(core): Implement Phases 32-50 Production Infrastructure #2, the getEngine().getRun(tenantId, runId) call performs a database read for each request to the step endpoints. If tenantAuthMiddleware doesn't already load and cache the run, this is an N+1 query scenario where N is the number of step requests for a given run.
    • Action: If the getRun call is deemed necessary for authorization, consider modifying tenantAuthMiddleware to load the Run object once and attach it to req.context or req.run for subsequent use within the request lifecycle. This would turn N separate getRun calls into 1 per request.
  3. Correctness: estimatedCostCents Calculation Fallback

    • File: apps/api/src/routes/budget.ts
    • Lines: 100, 103
    • Issue: When isMeteringEnabled() is false, estimatedCostCents is calculated as runsUsed * 10. The comment suggests this is a "Rough cost estimate: $0.10 per run average". This hardcoded value might become outdated or inaccurate.
    • Action: Consider making the $0.10 per run a configurable constant (e.g., in packages/core/src/config.ts or an environment variable) to allow for easier updates and better transparency.
  4. Correctness: PLAN_BUDGET_CENTS for 'free' tier

    • File: apps/api/src/routes/budget.ts
    • Line: 48
    • Issue: PLAN_BUDGET_CENTS.free is set to 0. This means percentUsed calculation for free tier will always be 0 if budgetLimitCents is used, or it will fall back to runsUsed / runsLimit if budgetLimitCents is 0. This is logically sound, but it might be clearer to explicitly handle the free tier's budget calculation if it's truly "unlimited" in terms of cost but limited by runs.
    • Action: Add a comment explaining the free: 0 budget and how percentUsed is derived for the free tier (i.e., primarily based on runsLimit).
  5. Architecture: Dynamic Imports in API Routes

    • File: apps/api/src/index.ts
    • Lines: 696, 1766, 1785
    • Issue: The use of await import('@gwi/core') inside route handlers can introduce a slight overhead on the first request to that route as the module is loaded. While this is often acceptable for less frequently accessed routes or to reduce initial bundle size, for core functionality like health checks or run steps, it might be better to import these modules at the top level of index.ts if they are always needed.
    • Action: Evaluate the performance impact. If these modules are frequently accessed, consider moving the import statements to the top of index.ts to ensure they are loaded at application startup. For budget.ts and quota.ts, the imports are already at the top of their respective files, which is good.
  6. Security: Dependabot Auto-Merge Scope

    • File: .github/dependabot.yml
    • Lines: 15-17
    • Issue: The comment mentions "reviewer agent can auto-merge patch/minor". While the groups section is defined, there's no explicit auto-merge configuration in the Dependabot YAML itself. This implies the auto-merge logic is handled by an external agent based on the gwi-triage label. It's crucial that this external agent is robustly secured and configured to only auto-merge minor and patch updates, and only after CI checks pass.
    • Action: Ensure the external reviewer agent has strict policies to only auto-merge minor and patch updates, and only if all required CI checks (including security scans) pass. Document this policy clearly.
  7. Correctness: SBOM Action Version Pinning

    • File: .github/workflows/sbom.yml
    • Line: 34
    • Issue: The anchore/sbom-action@v0 action is pinned to v0. While actions/checkout@v4 and actions/setup-node@v4 are pinned to specific SHAs, v0 for sbom-action is less specific and could potentially introduce breaking changes if v0 is updated by Anchore.
    • Action: Pin anchore/sbom-action to a specific SHA for better supply chain security and reproducibility, similar to the other actions.
  8. Architecture: COLLECTIONS.STEPS Definition

    • File: packages/core/src/storage/firestore-step.ts
    • Line: 23
    • Issue: The COLLECTIONS.STEPS constant is used for the subcollection name. It's good that it's a constant, but its definition is not shown in the PR. It should be defined in firestore-client.ts alongside other collection names for consistency.
    • Action: Verify that COLLECTIONS.STEPS is correctly defined in packages/core/src/storage/firestore-client.js and follows the naming conventions for other collections.

Good Practices ✅

  1. Agent-First Design: The PR clearly articulates an "agent-first" approach, providing dedicated endpoints and health checks for agents to make informed decisions (e.g., /health/providers, /tenants/:tenantId/budget, /tenants/:tenantId/quota, and the StepStore). This is excellent for building a scalable and resilient platform.
  2. Robust Dependabot Configuration: The dependabot.yml is well-structured, covers multiple package ecosystems (npm, Docker, GitHub Actions), uses groups for npm updates, and applies consistent labels and commit message prefixes. This significantly improves dependency management and security.
  3. SBOM Generation: The sbom.yml workflow for generating CycloneDX SBOMs on release is a strong step towards supply chain security and compliance, allowing agents to audit dependencies. The use of workflow_dispatch for manual triggering is also a good practice.
  4. Circuit Breaker and Retry Pattern: The implementation of resilient-provider.ts and provider-health.ts for LLM providers demonstrates a mature approach to handling external service dependencies. The global circuit breaker and retry logic enhance the system's resilience against transient failures and provider outages.
  5. Clear Separation of Concerns: The new budget.ts and quota.ts modules for API routes, and the StepStore interface with InMemoryStepStore and FirestoreStepStore implementations, show good architectural design and separation of concerns.
  6. Comprehensive Testing: The addition of dedicated test files (resilient-provider.test.ts, step-store.test.ts) for new core logic and storage

Powered by Gemini 2.5 Flash

@gemini-code-assist
Copy link

Summary of Changes

Hello @jeremylongshore, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly advances the platform's 'agent-first' infrastructure by introducing several key features focused on scalability, operational maturity, and autonomous agent decision-making. The changes enable agents to make more intelligent, cost-aware, and reliability-aware choices by providing them with real-time data on budget usage, LLM provider health, and rate limits. Additionally, it lays foundational documentation for the broader agent ecosystem and strategic roadmap.

Highlights

  • Dependabot + SBOM Integration: Implemented Dependabot configuration to automate dependency updates, grouping them and applying gwi-triage labels for agent auto-scoring. CycloneDX SBOM generation is now configured to run on release via anchore/sbom-action.
  • VPC + Budget Alerts: Enabled VPC connector in production and configured GCP billing budget alerts with Pub/Sub integration. A new agent-queryable API endpoint GET /tenants/:tenantId/budget was added, providing current spend, limits, and a recommendation field (normal/conserve/stop) to guide cost-aware agent decisions.
  • Circuit Breaker for LLM Providers: Integrated CircuitBreaker and ExponentialBackoffRetryHandler to wrap LLM providers, enhancing reliability. A provider health registry singleton was introduced, and the selection policy now skips providers with open circuits. A GET /health/providers endpoint was added to expose provider health status.
  • Run Steps as Subcollection: Introduced a StepStore interface with Firestore and InMemory implementations to manage run steps as a subcollection. New API endpoints GET /tenants/:tenantId/runs/:runId/steps and .../steps/:stepId support cursor-based pagination for agent-first querying of individual steps.
  • Distributed Rate Limiting: Added an agent-queryable API endpoint GET /tenants/:tenantId/quota which provides per-action rate limit status and a recommendation field, enabling autonomous agents to make reliability-aware decisions based on their current quota.
  • Agent Ecosystem Documentation: New documentation was added outlining the agent ecosystem inventory, a strategic analysis for Automaton integration, engineering risks and mitigations for AI agents, a refactor analysis for Bob's Brain, and the long-term vision for the Intent Agent Network.
Changelog
  • .github/dependabot.yml
    • Added Dependabot configuration for npm, Docker, and GitHub Actions dependencies.
    • Configured Dependabot to apply gwi-triage labels to PRs for automated processing.
  • 000-docs/241-PP-RMAP-agent-ecosystem-inventory.md
    • Added a new document detailing a comprehensive inventory of agent projects across the Intent Solutions ecosystem.
  • 000-docs/242-PP-RMAP-automaton-integration-strategy.md
    • Added a new document evaluating three integration strategies for Conway's Automaton and recommending a deep integration approach.
  • 000-docs/243-PP-RMAP-engineering-risks-mitigations.md
    • Added a new document identifying, ranking, and proposing mitigations for engineering risks related to Automaton integration.
  • 000-docs/244-PP-RMAP-bob-refactor-analysis.md
    • Added a new document analyzing language/framework options for refactoring Bob's Brain, recommending to stay with Python for now.
  • 000-docs/245-PP-RMAP-agent-network-vision.md
    • Added a new document articulating the long-term vision for the Intent Agent Network, focusing on agent coordination and economy.
  • README.md
    • Updated README to include CI/CD, Tests, and ARV workflow badges.
  • apps/api/src/index.ts
    • Added a new GET endpoint /health/providers to expose LLM provider circuit breaker states.
    • Introduced new GET endpoints /tenants/:tenantId/runs/:runId/steps and .../steps/:stepId for querying run steps as a subcollection with pagination.
  • apps/api/src/routes/budget.ts
    • Added a new file defining API routes for budget status, including an agent-actionable recommendation.
  • apps/api/src/routes/quota.ts
    • Added a new file defining API routes for rate limit quota status, including an agent-actionable recommendation.
  • infra/budget.tf
    • Added new Terraform configuration to provision GCP billing budget alerts and a Pub/Sub topic for notifications.
  • infra/envs/prod.tfvars
    • Enabled VPC connector for production deployments.
    • Configured budget alerts for production with a monthly budget and alert thresholds.
  • packages/core/src/llm/tests/resilient-provider.test.ts
    • Added new test file for the resilient LLM provider wrapper, verifying circuit breaker and retry logic.
  • packages/core/src/llm/index.ts
    • Exported new modules for LLM provider health and resilient provider wrapping.
  • packages/core/src/llm/provider-health.ts
    • Added a new file implementing a singleton health registry for LLM providers, including circuit breaker state management and health reporting.
  • packages/core/src/llm/resilient-provider.ts
    • Added a new file providing a wrapper for LLM providers that integrates circuit breaker and exponential backoff retry logic.
  • packages/core/src/llm/selection-policy.ts
    • Modified the LLM provider selection policy to exclude providers with open circuit breakers.
  • packages/core/src/storage/tests/step-store.test.ts
    • Added new test file for the InMemoryStepStore, verifying step addition, retrieval, pagination, and status updates.
  • packages/core/src/storage/firestore-step.ts
    • Added a new file implementing the StepStore interface using Firestore subcollections for run steps.
  • packages/core/src/storage/index.ts
    • Updated the storage index to export InMemoryStepStore and FirestoreStepStore.
    • Modified getStepStore function to provide the correct step store implementation based on configuration.
  • packages/core/src/storage/inmemory.ts
    • Updated the in-memory storage implementation to include InMemoryStepStore for managing run steps.
  • packages/core/src/storage/interfaces.ts
    • Extended storage interfaces to include StepStore and PaginatedResult for managing run steps as a subcollection.
Ignored Files
  • Ignored by pattern: .github/workflows/** (1)
    • .github/workflows/sbom.yml
Activity
  • Confirmed npx tsc --noEmit passes for @gwi/core, apps/api, apps/gateway.
  • Verified 19 new tests pass: 10 for step-store and 9 for resilient-provider.
  • Validated YAML for dependabot.yml and sbom.yml.
  • Ensured tofu fmt -check passes on budget.tf.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This is a substantial pull request that introduces several significant features under the umbrella of "Scale & Ops Maturity." The changes, including the addition of Dependabot for automated dependency management, GCP budget alerts, a circuit breaker for LLM providers, and the refactoring of run steps into a subcollection, are well-implemented and significantly enhance the project's robustness, observability, and operational maturity. The new API endpoints are thoughtfully designed to be agent-actionable. My review focuses on a few minor areas where maintainability can be further improved by reducing hardcoded values and configuration duplication.

Comment on lines +25 to +68
# Docker — apps/api
- package-ecosystem: docker
directory: "/apps/api"
schedule:
interval: weekly
labels: ["dependencies", "docker", "gwi-triage"]
commit-message:
prefix: "chore(deps)"

# Docker — apps/gateway
- package-ecosystem: docker
directory: "/apps/gateway"
schedule:
interval: weekly
labels: ["dependencies", "docker", "gwi-triage"]
commit-message:
prefix: "chore(deps)"

# Docker — apps/worker
- package-ecosystem: docker
directory: "/apps/worker"
schedule:
interval: weekly
labels: ["dependencies", "docker", "gwi-triage"]
commit-message:
prefix: "chore(deps)"

# Docker — apps/github-webhook
- package-ecosystem: docker
directory: "/apps/github-webhook"
schedule:
interval: weekly
labels: ["dependencies", "docker", "gwi-triage"]
commit-message:
prefix: "chore(deps)"

# Docker — apps/webhook-receiver
- package-ecosystem: docker
directory: "/apps/webhook-receiver"
schedule:
interval: weekly
labels: ["dependencies", "docker", "gwi-triage"]
commit-message:
prefix: "chore(deps)"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The configuration for the Docker package ecosystem is repeated for each application (apps/api, apps/gateway, etc.). This repetition makes the file harder to maintain, as any change to the schedule or labels needs to be applied in multiple places.

To improve maintainability and reduce duplication, you can use YAML anchors. Define a common configuration block once and then reuse it for each Docker entry.

Here's an example of how you could refactor it:

# Define a reusable config block at the top of your file
x-docker-dependabot-config: &docker-config
  package-ecosystem: docker
  schedule:
    interval: weekly
  labels: ["dependencies", "docker", "gwi-triage"]
  commit-message:
    prefix: "chore(deps)"

updates:
  # ... (npm and other configs)

  # Docker — apps/api
  - <<: *docker-config
    directory: "/apps/api"

  # Docker — apps/gateway
  - <<: *docker-config
    directory: "/apps/gateway"

  # ... and so on for other Docker services

const status = await meteringService.getPlanUsageStatus(tenantId);
runsUsed = status.plan.run_limit - status.runs_remaining;
// Rough cost estimate: $0.10 per run average
estimatedCostCents = runsUsed * 10;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The value 10 used for the rough cost estimate is a magic number. It appears again on lines 101 and 104. Using a named constant would improve readability and make it easier to update this value in the future.

I suggest defining a constant at the top of the file:

const ROUGH_COST_PER_RUN_CENTS = 10;

And then using it here and in the other locations:

estimatedCostCents = runsUsed * ROUGH_COST_PER_RUN_CENTS;

*/
export function getProviderHealth(): ProviderHealthReport {
const cb = getLLMCircuitBreaker();
const providerTypes: LLMProviderType[] = ['anthropic', 'google', 'openai', 'vertex', 'openai_compat'];

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The list of providerTypes is hardcoded. This can lead to maintenance issues, as new providers added to the system won't be automatically included in health reports. To make this more robust and extensible, you should dynamically retrieve the list of providers from the providerRegistry.

You'll need to import it first:

import { providerRegistry } from './index.js';

Then you can use it to get the list of providers.

Suggested change
const providerTypes: LLMProviderType[] = ['anthropic', 'google', 'openai', 'vertex', 'openai_compat'];
const providerTypes: LLMProviderType[] = providerRegistry.list();

@greptile-apps
Copy link

greptile-apps bot commented Feb 20, 2026

Greptile Summary

This PR implements critical scale and ops maturity features for agent-first infrastructure: circuit breakers for LLM providers, step subcollection storage, budget/quota APIs, and Dependabot automation.

Key Changes

  • LLM Circuit Breaker (gwi-d1k): Wraps providers with circuit breaker + retry logic, exposes health via GET /health/providers
  • Step Subcollection (gwi-o06): Moves run steps to Firestore subcollection with cursor pagination via GET /tenants/:id/runs/:id/steps
  • Budget & Quota APIs (gwi-keq, gwi-5a6): Agent-queryable endpoints with recommendation fields (normal/conserve/stop/throttle)
  • Infrastructure: Dependabot with gwi-triage labels, SBOM generation, GCP budget alerts, VPC connector in prod
  • Documentation: Five new PP-RMAP strategic docs (241-245)

Critical Issues

  • Budget and quota route files created but not mounted in apps/api/src/index.ts — the /tenants/:tenantId/budget and /tenants/:tenantId/quota endpoints won't be accessible until integrated
  • Test coverage for circuit breaker trip behavior could be more comprehensive

Positive Highlights

  • Circuit breaker integration with selection policy is well-designed
  • Step store pagination correctly implements cursor-based approach
  • All new features include agent-actionable recommendation fields as promised
  • Test coverage is solid (19 new tests)
  • Infrastructure changes follow best practices (Terraform, no hardcoded values)

Confidence Score: 3/5

  • Safe to merge after fixing budget/quota route integration
  • Score reflects two critical integration gaps: budget and quota routes are defined but not mounted in the API, making them inaccessible. These are documented in the test plan as needing verification but weren't completed. The core implementations are sound, tests pass, and infrastructure changes are properly configured. Once the route integration is fixed, this would be a 4/5.
  • apps/api/src/index.ts requires integration of budget and quota routes. apps/api/src/routes/budget.ts and apps/api/src/routes/quota.ts need to be imported and mounted.

Important Files Changed

Filename Overview
infra/budget.tf Adds GCP budget alerts with Pub/Sub notifications for cost management
packages/core/src/llm/resilient-provider.ts Implements circuit breaker and retry logic for LLM providers; circuit breaker config looks good but test coverage could verify actual circuit trips
packages/core/src/llm/tests/resilient-provider.test.ts Tests for resilient provider wrapper; good coverage of retry logic and health registry, missing actual circuit trip scenarios
packages/core/src/storage/firestore-step.ts Implements Firestore-backed step store with cursor-based pagination
apps/api/src/routes/budget.ts Budget API route with agent recommendations; not yet integrated into main API index.ts
apps/api/src/routes/quota.ts Rate limit quota API route with agent recommendations; not yet integrated into main API index.ts
apps/api/src/index.ts Adds /health/providers endpoint and step subcollection routes; missing integration of budget and quota routes

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Agent/API Request] --> B{LLM Provider Call}
    B --> C[ResilientProvider Wrapper]
    C --> D{Circuit Breaker Check}
    D -->|Circuit Open| E[Fast Fail]
    D -->|Circuit Closed/Half-Open| F[Retry Handler]
    F --> G{Retryable Error?}
    G -->|Yes 429/5xx| H[Exponential Backoff]
    H --> F
    G -->|No| I[Propagate Error]
    F -->|Success| J[Update Health Registry]
    F -->|Max Retries| K[Trip Circuit]
    K --> L[Record Provider Failure]
    
    M[Agent Budget Check] --> N[GET /tenants/:id/budget]
    N --> O[Tenant Store]
    O --> P{Budget Status}
    P -->|< 80%| Q[Recommendation: normal]
    P -->|80-95%| R[Recommendation: conserve]
    P -->|>= 95%| S[Recommendation: stop]
    
    T[Agent Quota Check] --> U[GET /tenants/:id/quota]
    U --> V[Rate Limiter]
    V --> W{Quota Status}
    W -->|Available| X[Recommendation: normal]
    W -->|Low| Y[Recommendation: throttle]
    
    Z[Run Step Query] --> AA[GET /runs/:id/steps]
    AA --> AB[StepStore Subcollection]
    AB --> AC[Firestore/InMemory]
    AC --> AD[Paginated Results]
    
    AE[Dependabot PR] --> AF[gwi-triage Label]
    AF --> AG[Auto-Score Agent]
    
    AH[Budget Alert] --> AI[GCP Billing Budget]
    AI --> AJ[Pub/Sub Topic]
    AJ --> AK[Agent Subscriber]
Loading

Last reviewed commit: a4a8739

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

23 files reviewed, 4 comments

Edit Code Review Agent Settings | Greptile

Comment on lines +1 to +147
/**
* Budget API Route — Scale & Ops Maturity (gwi-keq)
*
* Agent-first: Agents query budget status to make cost-aware decisions
* (e.g., switch to cheaper models when budget is 80% consumed).
*
* GET /tenants/:tenantId/budget — Returns current spend vs limit with
* an agent-actionable recommendation field.
*
* @module @gwi/api/routes/budget
*/

import { Router } from 'express';
import {
expressRequireAuth,
expressRequirePermission,
type RBACRequest,
} from '@gwi/core';
import { createLogger } from '@gwi/core';

const logger = createLogger('budget-routes');

/**
* Determine agent-actionable recommendation based on budget usage
*/
function getRecommendation(percentUsed: number): 'normal' | 'conserve' | 'stop' {
if (percentUsed >= 95) return 'stop';
if (percentUsed >= 80) return 'conserve';
return 'normal';
}

/**
* Plan tier run limits (matches tenant plan limits)
*/
const PLAN_RUN_LIMITS: Record<string, number> = {
free: 50,
team: 500,
pro: 2000,
enterprise: 10000,
};

const PLAN_BUDGET_CENTS: Record<string, number> = {
free: 0,
team: 5000, // $50
pro: 20000, // $200
enterprise: 100000, // $1000
};

/**
* Create budget API routes
*/
export function createBudgetRoutes(): Router {
const router = Router();

/**
* GET /tenants/:tenantId/budget
*
* Returns current budget status with agent-actionable recommendation.
* Agents use this to decide whether to use cheaper models or stop.
*/
router.get(
'/tenants/:tenantId/budget',
expressRequireAuth(),
expressRequirePermission('tenant:read'),
async (req: RBACRequest, res) => {
const { tenantId } = req.params;

try {
const { getTenantStore, getMeteringService, isMeteringEnabled } = await import('@gwi/core');

const tenantStore = getTenantStore();
const tenant = await tenantStore.getTenant(tenantId);

if (!tenant) {
return res.status(404).json({ error: 'Tenant not found' });
}

const plan = tenant.plan || 'free';
const runsLimit = tenant.planLimits?.runsPerMonth ?? PLAN_RUN_LIMITS[plan] ?? 50;
const budgetLimitCents = PLAN_BUDGET_CENTS[plan] ?? 0;

// Get current period key (YYYY-MM)
const now = new Date();
const period = `${now.getFullYear()}-${String(now.getMonth() + 1).padStart(2, '0')}`;

// Get runs used this month
let runsUsed = tenant.runsThisMonth ?? 0;
let estimatedCostCents = 0;

// If metering is enabled, use actual usage data
if (isMeteringEnabled()) {
try {
const meteringService = getMeteringService();
meteringService.setTenantPlan(tenantId, plan);
const status = await meteringService.getPlanUsageStatus(tenantId);
runsUsed = status.plan.run_limit - status.runs_remaining;
// Rough cost estimate: $0.10 per run average
estimatedCostCents = runsUsed * 10;
} catch {
// Fall back to tenant.runsThisMonth
estimatedCostCents = runsUsed * 10;
}
} else {
estimatedCostCents = runsUsed * 10;
}

const percentUsed = budgetLimitCents > 0
? Math.min(100, Math.round((estimatedCostCents / budgetLimitCents) * 100))
: (runsLimit > 0 ? Math.min(100, Math.round((runsUsed / runsLimit) * 100)) : 0);

const recommendation = getRecommendation(percentUsed);

logger.info('Budget status queried', {
eventName: 'budget.query',
tenantId,
plan,
percentUsed,
recommendation,
});

res.json({
tenantId,
plan,
period,
runsUsed,
runsLimit,
estimatedCostCents,
budgetLimitCents,
percentUsed,
recommendation,
});
} catch (error) {
logger.error('Failed to get budget status', {
eventName: 'budget.error',
tenantId,
error: error instanceof Error ? error.message : 'Unknown error',
});
res.status(500).json({
error: 'Failed to get budget status',
message: error instanceof Error ? error.message : 'Unknown error',
});
}
},
);

return router;
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Budget route created but not integrated into apps/api/src/index.ts. Need to import createBudgetRoutes() and mount it with app.use() to make the /tenants/:tenantId/budget endpoint accessible.

Comment on lines +1 to +132
/**
* Rate Limit Quota API Route — Scale & Ops Maturity (gwi-5a6)
*
* Agent-first: Per-agent rate limits with metering. Agents authenticate,
* get rate-limited based on their tier. Agents can query their own rate
* limit status and remaining quota via this endpoint.
*
* GET /tenants/:tenantId/quota — Returns rate limit status per action
* with agent-actionable recommendation.
*
* Distinct from /quotas (resource quotas) — this is about rate limiting.
*
* @module @gwi/api/routes/quota
*/

import { Router } from 'express';
import {
expressRequireAuth,
expressRequirePermission,
type RBACRequest,
} from '@gwi/core';
import {
getRateLimiter,
DEFAULT_RATE_LIMITS,
type RateLimitResult,
} from '@gwi/core';
import { createLogger } from '@gwi/core';

const logger = createLogger('quota-routes');

/**
* Agent-queryable rate limit status for a single action
*/
interface ActionQuotaStatus {
limit: number;
remaining: number;
resetsAt: string;
}

/**
* Determine agent-actionable recommendation
*/
function getRecommendation(
statuses: Record<string, ActionQuotaStatus>,
): 'normal' | 'throttle' {
for (const status of Object.values(statuses)) {
if (status.remaining <= 0) return 'throttle';
const percentUsed = ((status.limit - status.remaining) / status.limit) * 100;
if (percentUsed >= 80) return 'throttle';
}
return 'normal';
}

/**
* Create rate limit quota API routes
*/
export function createQuotaRateLimitRoutes(): Router {
const router = Router();

/**
* GET /tenants/:tenantId/quota
*
* Returns current rate limit status per action for the tenant.
* Agent-actionable: includes recommendation field.
*/
router.get(
'/tenants/:tenantId/quota',
expressRequireAuth(),
expressRequirePermission('tenant:read'),
async (req: RBACRequest, res) => {
const { tenantId } = req.params;

try {
const { getTenantStore } = await import('@gwi/core');
const tenantStore = getTenantStore();
const tenant = await tenantStore.getTenant(tenantId);

if (!tenant) {
return res.status(404).json({ error: 'Tenant not found' });
}

const limiter = getRateLimiter();
const plan = tenant.plan || 'free';

// Check key actions
const actions = ['run:create', 'api:read', 'signal:create', 'candidate:generate', 'api:write'];
const limits: Record<string, ActionQuotaStatus> = {};

for (const action of actions) {
const config = DEFAULT_RATE_LIMITS[action];
if (!config) continue;

const status: RateLimitResult = await limiter.status(tenantId, action);
const resetTime = new Date(Date.now() + status.resetInMs);

limits[action] = {
limit: status.limit,
remaining: status.remaining,
resetsAt: resetTime.toISOString(),
};
}

const recommendation = getRecommendation(limits);

logger.info('Rate limit quota queried', {
eventName: 'quota.ratelimit.query',
tenantId,
plan,
recommendation,
});

res.json({
limits,
plan,
recommendation,
});
} catch (error) {
logger.error('Failed to get rate limit quota', {
eventName: 'quota.ratelimit.error',
tenantId,
error: error instanceof Error ? error.message : 'Unknown error',
});
res.status(500).json({
error: 'Failed to get rate limit quota',
message: error instanceof Error ? error.message : 'Unknown error',
});
}
},
);

return router;
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quota route created but not integrated into apps/api/src/index.ts. Need to import createQuotaRateLimitRoutes() and mount it with app.use() to make the /tenants/:tenantId/quota endpoint accessible.

Comment on lines +133 to +195
describe('Resilient Provider Wrapper', () => {
beforeEach(() => {
resetLLMCircuitBreaker();
});

it('passes through successful calls', async () => {
const mock = createMockProvider();
const resilient = wrapWithResilience(mock, { maxRetries: 1 });

const result = await resilient.completeJson(dummyRequest);
expect(result.json).toEqual({ result: 'ok' });
expect(mock.callCount).toBe(1);
});

it('preserves provider type and name', () => {
const mock = createMockProvider();
const resilient = wrapWithResilience(mock);

expect(resilient.type).toBe('anthropic');
expect(resilient.name).toBe('mock-anthropic');
expect(resilient.getModel()).toBe('claude-sonnet-4-20250514');
expect(resilient.isAvailable()).toBe(true);
});

it('retries on retryable errors', async () => {
// Fails first 2 calls, succeeds on 3rd
let attempt = 0;
const mock = createMockProvider();
const originalCompleteJson = mock.completeJson.bind(mock);
mock.completeJson = async (req) => {
attempt++;
if (attempt <= 2) {
throw new Error('Service unavailable (status 503)');
}
return originalCompleteJson(req);
};

const resilient = wrapWithResilience(mock, { maxRetries: 3, initialDelayMs: 10 });
const result = await resilient.completeJson(dummyRequest);
expect(result.json).toEqual({ result: 'ok' });
});

it('does not retry on non-retryable errors', async () => {
let attempts = 0;
const mock = createMockProvider();
mock.completeJson = async () => {
attempts++;
throw new Error('Invalid API key (status 401)');
};

const resilient = wrapWithResilience(mock, { maxRetries: 3, initialDelayMs: 10 });
await expect(resilient.completeJson(dummyRequest)).rejects.toThrow('Invalid API key');
expect(attempts).toBe(1);
});

it('works with completeText too', async () => {
const mock = createMockProvider();
const resilient = wrapWithResilience(mock, { maxRetries: 1 });

const result = await resilient.completeText(dummyTextRequest);
expect(result.text).toBe('Hello');
});
});
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing test for actual circuit breaker trip behavior. The current tests verify retry logic and health registry state, but don't test that after N consecutive failures (per circuit breaker config), subsequent calls fail fast without attempting the provider call.

description = "GCP Billing Account ID (required if enable_budget_alerts = true)"
type = string
default = ""
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The billing_account_id variable should be marked as sensitive since it's account infrastructure metadata.

Suggested change
}
sensitive = true

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

@qodo-code-review
Copy link

Code Review by Qodo

🐞 Bugs (4) 📘 Rule violations (4) 📎 Requirement gaps (0)

Grey Divider


Action required

1. limit/cursor not Zod-validated 📘 Rule violation ⛨ Security
Description
The new steps listing endpoint consumes query parameters without Zod schema validation or bounded
constraints. This enables unbounded limit values and malformed cursors, risking performance and
reliability issues.
Code

apps/api/src/index.ts[R1786-1789]

+  const { tenantId, runId } = req.params;
+  const limit = parseInt(req.query.limit as string) || 50;
+  const cursor = req.query.cursor as string | undefined;
+
Evidence
The validation rule requires using Zod with explicit bounds for all external/untrusted inputs; here
req.query.limit and req.query.cursor are parsed/used directly with no schema or max constraints.

CLAUDE.md
apps/api/src/index.ts[1786-1789]
apps/api/src/index.ts[1801-1803]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
The endpoint `GET /tenants/:tenantId/runs/:runId/steps` parses `limit` and `cursor` directly from the querystring and uses them without Zod validation or bounded constraints.

## Issue Context
Compliance requires all external inputs be validated with Zod and have reasonable bounds (e.g., `limit` max 100).

## Fix Focus Areas
- apps/api/src/index.ts[1785-1803]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


2. meteringService error swallowed 📘 Rule violation ⛯ Reliability
Description
Budget usage retrieval swallows metering exceptions without logging or contextual error details.
This makes production debugging and incident forensics difficult and can hide recurring metering
failures.
Code

apps/api/src/routes/budget.ts[R91-102]

+        if (isMeteringEnabled()) {
+          try {
+            const meteringService = getMeteringService();
+            meteringService.setTenantPlan(tenantId, plan);
+            const status = await meteringService.getPlanUsageStatus(tenantId);
+            runsUsed = status.plan.run_limit - status.runs_remaining;
+            // Rough cost estimate: $0.10 per run average
+            estimatedCostCents = runsUsed * 10;
+          } catch {
+            // Fall back to tenant.runsThisMonth
+            estimatedCostCents = runsUsed * 10;
+          }
Evidence
The robust error handling rule forbids swallowed exceptions without logging; the metering block
catches and ignores all errors, providing no diagnostic context.

Rule 3: Generic: Robust Error Handling and Edge Case Management
apps/api/src/routes/budget.ts[91-102]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
The metering lookup in the budget endpoint catches errors with a bare `catch {}` and does not log anything. This is a swallowed exception and removes necessary context for debugging.

## Issue Context
The endpoint is production-facing and the metering layer is an external dependency/failure point. Failures should be observable (structured log) and still gracefully degrade to the fallback behavior.

## Fix Focus Areas
- apps/api/src/routes/budget.ts[91-105]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


3. Budget/quota routes unintegrated 🐞 Bug ✓ Correctness
Description
Budget/quota endpoints are implemented as Router factories using RBAC middleware that depends on
req.context, but apps/api’s entrypoint uses a different per-route auth + tenant scoping chain; as a
result these endpoints are likely unreachable or will 401/403 even if mounted.
Code

apps/api/src/routes/budget.ts[R52-66]

+export function createBudgetRoutes(): Router {
+  const router = Router();
+
+  /**
+   * GET /tenants/:tenantId/budget
+   *
+   * Returns current budget status with agent-actionable recommendation.
+   * Agents use this to decide whether to use cheaper models or stop.
+   */
+  router.get(
+    '/tenants/:tenantId/budget',
+    expressRequireAuth(),
+    expressRequirePermission('tenant:read'),
+    async (req: RBACRequest, res) => {
+      const { tenantId } = req.params;
Evidence
The new budget/quota routes use expressRequireAuth/expressRequirePermission which derive
identity/role from req.context; however, the main API app populates req.context via its own
authMiddleware/tenantAuthMiddleware pattern on each route, and apps/api runs dist/index.js (compiled
from src/index.ts). Without explicit mounting + compatible middleware ordering, these route modules
won’t behave like the rest of the API.

apps/api/src/routes/budget.ts[52-66]
packages/core/src/security/rbac.ts[456-466]
apps/api/src/index.ts[1092-1133]
apps/api/package.json[6-11]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`createBudgetRoutes()` / `createQuotaRateLimitRoutes()` are implemented with `expressRequireAuth()` / `expressRequirePermission()`, but the API app primarily uses `authMiddleware` + `tenantAuthMiddleware` + `requirePermission()` to populate `req.context` and scope access. Without explicit mounting and consistent middleware usage, the endpoints may be unreachable or always return auth errors.

## Issue Context
- apps/api runtime entrypoint is `dist/index.js` built from `apps/api/src/index.ts`.
- Existing tenant-scoped endpoints use `authMiddleware` and `tenantAuthMiddleware`.
- RBAC middleware in `@gwi/core` extracts identity from `req.context`.

## Fix Focus Areas
- apps/api/src/routes/budget.ts[52-66]
- apps/api/src/routes/quota.ts[57-70]
- apps/api/src/index.ts[1092-1133]
- packages/core/src/security/rbac.ts[456-466]

## Suggested approach
1. In `apps/api/src/index.ts`, add route registration for budget/quota endpoints using the same middleware chain as other `/tenants/:tenantId/*` routes.
2. Alternatively, change `createBudgetRoutes` / `createQuotaRateLimitRoutes` to accept middleware dependencies (or expose a shared auth/tenant middleware) so they behave consistently.
3. Add a minimal integration test hitting `/tenants/:tenantId/budget` and `/tenants/:tenantId/quota` with an authenticated user context to ensure they return 200/403 appropriately.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


View more (3)
4. Budget plan ID mismatch 🐞 Bug ✓ Correctness
Description
The budget endpoint sets the metering plan using tenant.plan values (free/team/pro/enterprise), but
metering only recognizes free/starter/professional/enterprise; this can silently fall back to free
and produce incorrect runsUsed/percentUsed and recommendations.
Code

apps/api/src/routes/budget.ts[R78-99]

+        const plan = tenant.plan || 'free';
+        const runsLimit = tenant.planLimits?.runsPerMonth ?? PLAN_RUN_LIMITS[plan] ?? 50;
+        const budgetLimitCents = PLAN_BUDGET_CENTS[plan] ?? 0;
+
+        // Get current period key (YYYY-MM)
+        const now = new Date();
+        const period = `${now.getFullYear()}-${String(now.getMonth() + 1).padStart(2, '0')}`;
+
+        // Get runs used this month
+        let runsUsed = tenant.runsThisMonth ?? 0;
+        let estimatedCostCents = 0;
+
+        // If metering is enabled, use actual usage data
+        if (isMeteringEnabled()) {
+          try {
+            const meteringService = getMeteringService();
+            meteringService.setTenantPlan(tenantId, plan);
+            const status = await meteringService.getPlanUsageStatus(tenantId);
+            runsUsed = status.plan.run_limit - status.runs_remaining;
+            // Rough cost estimate: $0.10 per run average
+            estimatedCostCents = runsUsed * 10;
+          } catch {
Evidence
Tenant lifecycle uses plan tiers including 'team' and 'pro'. Metering plans are defined with IDs
'starter' and 'professional' (not 'team'/'pro'). The budget route passes tenant.plan directly into
MeteringService.setTenantPlan(), so 'team'/'pro' will not match metering plan IDs and usage
computations can be wrong.

apps/api/src/routes/budget.ts[78-99]
packages/core/src/tenants/lifecycle.ts[81-99]
packages/core/src/metering/types.ts[221-269]
packages/core/src/billing/metering-bridge.ts[231-243]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`apps/api/src/routes/budget.ts` passes `tenant.plan` directly to `meteringService.setTenantPlan()`, but metering plan IDs differ from tenant plan tiers (`team`/`pro` vs `starter`/`professional`). This can make metering-based budget calculations incorrect.

## Issue Context
- Tenant plan tiers: `free | team | pro | enterprise`
- Metering plan IDs: `free | starter | professional | enterprise`
- A mapping function already exists in `packages/core/src/billing/metering-bridge.ts`.

## Fix Focus Areas
- apps/api/src/routes/budget.ts[78-99]
- packages/core/src/metering/types.ts[221-269]
- packages/core/src/tenants/lifecycle.ts[81-99]
- packages/core/src/billing/metering-bridge.ts[231-243]

## Suggested approach
1. Add a small mapping in the budget route (or import a shared mapper) such as:
  - free -&gt; free
  - team -&gt; starter
  - pro -&gt; professional
  - enterprise -&gt; enterprise
2. Call `meteringService.setTenantPlan(tenantId, mappedPlanId)`.
3. Add a unit test for the mapping (team/pro) to avoid regressions.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


5. Circuit breaker is a no-op 🐞 Bug ⛯ Reliability
Description
Selection now excludes providers with OPEN circuits, but provider calls are never wrapped with the
new circuit-breaker+retry wrapper; therefore circuits won’t open and /health/providers will always
report default states in real usage.
Code

packages/core/src/llm/selection-policy.ts[R336-340]

+    // Check circuit breaker health (gwi-d1k: skip providers with open circuits)
+    if (!isProviderHealthy(provider)) {
+      reasons.push(`Excluded: circuit breaker OPEN for ${provider}`);
+      return null;
+    }
Evidence
Provider selection consults isProviderHealthy(), but the only place the LLM circuit breaker is
executed is inside wrapWithResilience(). The default provider factory getLLMProvider() returns a
raw provider from the factory without wrapping, so cb.execute(providerKey, ...) is never invoked
and circuit state is never updated.

packages/core/src/llm/selection-policy.ts[336-340]
packages/core/src/llm/resilient-provider.ts[49-90]
packages/core/src/llm/index.ts[220-258]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
Circuit breaker health is checked during provider selection, but provider calls are not wrapped with the new resilience layer, so circuits never transition to OPEN in real traffic.

## Issue Context
- `ProviderSelectionPolicy` now calls `isProviderHealthy()`.
- Circuit breaker state/metrics are only updated when `getLLMCircuitBreaker().execute(...)` is used (currently in `wrapWithResilience`).
- `getLLMProvider()` returns raw providers without wrapping.

## Fix Focus Areas
- packages/core/src/llm/index.ts[220-258]
- packages/core/src/llm/resilient-provider.ts[49-90]
- packages/core/src/llm/selection-policy.ts[336-340]
- packages/core/src/llm/provider-health.ts[116-143]

## Suggested approach
1. In `getLLMProvider()`, after factory creation + availability check, return `wrapWithResilience(provider)`.
2. Add/adjust tests to ensure a failing provider trips the circuit and `isProviderHealthy()` returns false.
3. (Optional) Update `getProviderHealth()` to return `recommendation: null` when all providers are OPEN, or to only recommend providers with state != &#x27;open&#x27;.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


6. Step pagination inconsistent 🐞 Bug ⛯ Reliability
Description
StepStore pagination/order differs between Firestore (orderBy startedAt) and in-memory (insertion
order), and Firestore pagination uses cursor doc snapshots while startedAt is optional—this can lead
to inconsistent ordering across environments or brittle cursor behavior.
Code

packages/core/src/storage/firestore-step.ts[R114-122]

+    const limit = opts?.limit ?? 100;
+    let query = this.stepsCollection(runId).orderBy('startedAt').limit(limit + 1);
+
+    if (opts?.cursor) {
+      const cursorDoc = await this.stepsCollection(runId).doc(opts.cursor).get();
+      if (cursorDoc.exists) {
+        query = query.startAfter(cursorDoc);
+      }
+    }
Evidence
FirestoreStepStore orders results by startedAt and paginates with startAfter(cursorDoc), but
both RunStep.startedAt and StepDoc.startedAt are optional. In-memory listSteps does no sorting, so
results/pagination will not match Firestore’s ordering, making dev/prod behavior diverge.

packages/core/src/storage/firestore-step.ts[114-122]
packages/core/src/storage/firestore-step.ts[31-43]
packages/core/src/storage/inmemory.ts[1091-1110]
packages/core/src/storage/interfaces.ts[114-126]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`FirestoreStepStore` orders by `startedAt` and uses doc-snapshot cursors, while `InMemoryStepStore` doesn’t sort at all. Since `startedAt` is optional, ordering and pagination can diverge or become brittle.

## Issue Context
- Firestore uses `orderBy(&#x27;startedAt&#x27;)`.
- In-memory uses Map insertion order.
- `RunStep.startedAt` and `StepDoc.startedAt` are optional.

## Fix Focus Areas
- packages/core/src/storage/firestore-step.ts[114-122]
- packages/core/src/storage/inmemory.ts[1091-1110]
- packages/core/src/storage/interfaces.ts[114-126]
- packages/core/src/storage/firestore-step.ts[31-43]

## Suggested approach
1. Decide on a required ordering field (e.g., `startedAt` becomes required at creation time, or add `createdAt` that is always set).
2. Update both Firestore and InMemory implementations to sort by the same keys.
3. Consider adding a tie-breaker (e.g., also orderBy document ID) for deterministic pagination.
4. Add tests that assert ordering + cursor pagination stability.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



Remediation recommended

7. console.error in new routes 📘 Rule violation ✧ Quality
Description
The new run-steps endpoints use console.error, producing unstructured logs and potentially
printing full error objects/stack traces. This reduces log parseability and can increase the risk of
sensitive details being emitted.
Code

apps/api/src/index.ts[R1805-1809]

+    console.error('Failed to list steps:', error);
+    res.status(500).json({
+      error: 'Failed to list steps',
+      message: error instanceof Error ? error.message : 'Unknown error',
+    });
Evidence
Secure logging and observability requirements expect structured logs; these new handlers emit
unstructured console.error(...) logs for failures.

Rule 5: Generic: Secure Logging Practices
CLAUDE.md
apps/api/src/index.ts[1805-1809]
apps/api/src/index.ts[1837-1841]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
New endpoints log errors with `console.error`, which is unstructured and may emit sensitive/internal details.

## Issue Context
The repository already uses structured logging via `createLogger(...)` in new route modules; these endpoints should follow the same approach to support parsing, alerting, and safe logging practices.

## Fix Focus Areas
- apps/api/src/index.ts[1785-1810]
- apps/api/src/index.ts[1816-1842]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


8. New endpoints lack OpenAPI 📘 Rule violation ✓ Correctness
Description
This PR adds multiple new API endpoints but does not add/update an OpenAPI schema to document them.
This can break API discoverability and the OpenAPI validation gate.
Code

apps/api/src/routes/quota.ts[R66-70]

+  router.get(
+    '/tenants/:tenantId/quota',
+    expressRequireAuth(),
+    expressRequirePermission('tenant:read'),
+    async (req: RBACRequest, res) => {
Evidence
The compliance rule requires API changes to be accompanied by OpenAPI updates; the diff introduces
new routes (/health/providers, /tenants/:tenantId/budget, /tenants/:tenantId/quota, and run
step endpoints). The existing OpenAPI file in the repo is for a different surface area and does not
contain tenant/run endpoints.

CLAUDE.md
apps/api/src/index.ts[693-698]
apps/api/src/index.ts[1785-1786]
apps/api/src/routes/budget.ts[61-65]
apps/api/src/routes/quota.ts[66-70]
apps/gateway/openapi.yaml[1-12]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
New API endpoints were added in this PR but no corresponding OpenAPI documentation update is included.

## Issue Context
The repo includes an OpenAPI schema (and an ARV OpenAPI gate). Endpoints should be documented in the correct spec for the surface (gateway vs api) and validated.

## Fix Focus Areas
- apps/api/src/index.ts[693-705]
- apps/api/src/index.ts[1785-1844]
- apps/api/src/routes/budget.ts[61-144]
- apps/api/src/routes/quota.ts[66-129]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

ⓘ The new review experience is currently in Beta. Learn more

Grey Divider

Qodo Logo

Comment on lines +1786 to +1789
const { tenantId, runId } = req.params;
const limit = parseInt(req.query.limit as string) || 50;
const cursor = req.query.cursor as string | undefined;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Action required

1. limit/cursor not zod-validated 📘 Rule violation ⛨ Security

The new steps listing endpoint consumes query parameters without Zod schema validation or bounded
constraints. This enables unbounded limit values and malformed cursors, risking performance and
reliability issues.
Agent Prompt
## Issue description
The endpoint `GET /tenants/:tenantId/runs/:runId/steps` parses `limit` and `cursor` directly from the querystring and uses them without Zod validation or bounded constraints.

## Issue Context
Compliance requires all external inputs be validated with Zod and have reasonable bounds (e.g., `limit` max 100).

## Fix Focus Areas
- apps/api/src/index.ts[1785-1803]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

Comment on lines +91 to +102
if (isMeteringEnabled()) {
try {
const meteringService = getMeteringService();
meteringService.setTenantPlan(tenantId, plan);
const status = await meteringService.getPlanUsageStatus(tenantId);
runsUsed = status.plan.run_limit - status.runs_remaining;
// Rough cost estimate: $0.10 per run average
estimatedCostCents = runsUsed * 10;
} catch {
// Fall back to tenant.runsThisMonth
estimatedCostCents = runsUsed * 10;
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Action required

2. meteringservice error swallowed 📘 Rule violation ⛯ Reliability

Budget usage retrieval swallows metering exceptions without logging or contextual error details.
This makes production debugging and incident forensics difficult and can hide recurring metering
failures.
Agent Prompt
## Issue description
The metering lookup in the budget endpoint catches errors with a bare `catch {}` and does not log anything. This is a swallowed exception and removes necessary context for debugging.

## Issue Context
The endpoint is production-facing and the metering layer is an external dependency/failure point. Failures should be observable (structured log) and still gracefully degrade to the fallback behavior.

## Fix Focus Areas
- apps/api/src/routes/budget.ts[91-105]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

Comment on lines +52 to +66
export function createBudgetRoutes(): Router {
const router = Router();

/**
* GET /tenants/:tenantId/budget
*
* Returns current budget status with agent-actionable recommendation.
* Agents use this to decide whether to use cheaper models or stop.
*/
router.get(
'/tenants/:tenantId/budget',
expressRequireAuth(),
expressRequirePermission('tenant:read'),
async (req: RBACRequest, res) => {
const { tenantId } = req.params;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Action required

3. Budget/quota routes unintegrated 🐞 Bug ✓ Correctness

Budget/quota endpoints are implemented as Router factories using RBAC middleware that depends on
req.context, but apps/api’s entrypoint uses a different per-route auth + tenant scoping chain; as a
result these endpoints are likely unreachable or will 401/403 even if mounted.
Agent Prompt
## Issue description
`createBudgetRoutes()` / `createQuotaRateLimitRoutes()` are implemented with `expressRequireAuth()` / `expressRequirePermission()`, but the API app primarily uses `authMiddleware` + `tenantAuthMiddleware` + `requirePermission()` to populate `req.context` and scope access. Without explicit mounting and consistent middleware usage, the endpoints may be unreachable or always return auth errors.

## Issue Context
- apps/api runtime entrypoint is `dist/index.js` built from `apps/api/src/index.ts`.
- Existing tenant-scoped endpoints use `authMiddleware` and `tenantAuthMiddleware`.
- RBAC middleware in `@gwi/core` extracts identity from `req.context`.

## Fix Focus Areas
- apps/api/src/routes/budget.ts[52-66]
- apps/api/src/routes/quota.ts[57-70]
- apps/api/src/index.ts[1092-1133]
- packages/core/src/security/rbac.ts[456-466]

## Suggested approach
1. In `apps/api/src/index.ts`, add route registration for budget/quota endpoints using the same middleware chain as other `/tenants/:tenantId/*` routes.
2. Alternatively, change `createBudgetRoutes` / `createQuotaRateLimitRoutes` to accept middleware dependencies (or expose a shared auth/tenant middleware) so they behave consistently.
3. Add a minimal integration test hitting `/tenants/:tenantId/budget` and `/tenants/:tenantId/quota` with an authenticated user context to ensure they return 200/403 appropriately.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

Comment on lines +78 to +99
const plan = tenant.plan || 'free';
const runsLimit = tenant.planLimits?.runsPerMonth ?? PLAN_RUN_LIMITS[plan] ?? 50;
const budgetLimitCents = PLAN_BUDGET_CENTS[plan] ?? 0;

// Get current period key (YYYY-MM)
const now = new Date();
const period = `${now.getFullYear()}-${String(now.getMonth() + 1).padStart(2, '0')}`;

// Get runs used this month
let runsUsed = tenant.runsThisMonth ?? 0;
let estimatedCostCents = 0;

// If metering is enabled, use actual usage data
if (isMeteringEnabled()) {
try {
const meteringService = getMeteringService();
meteringService.setTenantPlan(tenantId, plan);
const status = await meteringService.getPlanUsageStatus(tenantId);
runsUsed = status.plan.run_limit - status.runs_remaining;
// Rough cost estimate: $0.10 per run average
estimatedCostCents = runsUsed * 10;
} catch {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Action required

4. Budget plan id mismatch 🐞 Bug ✓ Correctness

The budget endpoint sets the metering plan using tenant.plan values (free/team/pro/enterprise), but
metering only recognizes free/starter/professional/enterprise; this can silently fall back to free
and produce incorrect runsUsed/percentUsed and recommendations.
Agent Prompt
## Issue description
`apps/api/src/routes/budget.ts` passes `tenant.plan` directly to `meteringService.setTenantPlan()`, but metering plan IDs differ from tenant plan tiers (`team`/`pro` vs `starter`/`professional`). This can make metering-based budget calculations incorrect.

## Issue Context
- Tenant plan tiers: `free | team | pro | enterprise`
- Metering plan IDs: `free | starter | professional | enterprise`
- A mapping function already exists in `packages/core/src/billing/metering-bridge.ts`.

## Fix Focus Areas
- apps/api/src/routes/budget.ts[78-99]
- packages/core/src/metering/types.ts[221-269]
- packages/core/src/tenants/lifecycle.ts[81-99]
- packages/core/src/billing/metering-bridge.ts[231-243]

## Suggested approach
1. Add a small mapping in the budget route (or import a shared mapper) such as:
   - free -> free
   - team -> starter
   - pro -> professional
   - enterprise -> enterprise
2. Call `meteringService.setTenantPlan(tenantId, mappedPlanId)`.
3. Add a unit test for the mapping (team/pro) to avoid regressions.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

Comment on lines +336 to +340
// Check circuit breaker health (gwi-d1k: skip providers with open circuits)
if (!isProviderHealthy(provider)) {
reasons.push(`Excluded: circuit breaker OPEN for ${provider}`);
return null;
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Action required

5. Circuit breaker is a no-op 🐞 Bug ⛯ Reliability

Selection now excludes providers with OPEN circuits, but provider calls are never wrapped with the
new circuit-breaker+retry wrapper; therefore circuits won’t open and /health/providers will always
report default states in real usage.
Agent Prompt
## Issue description
Circuit breaker health is checked during provider selection, but provider calls are not wrapped with the new resilience layer, so circuits never transition to OPEN in real traffic.

## Issue Context
- `ProviderSelectionPolicy` now calls `isProviderHealthy()`.
- Circuit breaker state/metrics are only updated when `getLLMCircuitBreaker().execute(...)` is used (currently in `wrapWithResilience`).
- `getLLMProvider()` returns raw providers without wrapping.

## Fix Focus Areas
- packages/core/src/llm/index.ts[220-258]
- packages/core/src/llm/resilient-provider.ts[49-90]
- packages/core/src/llm/selection-policy.ts[336-340]
- packages/core/src/llm/provider-health.ts[116-143]

## Suggested approach
1. In `getLLMProvider()`, after factory creation + availability check, return `wrapWithResilience(provider)`.
2. Add/adjust tests to ensure a failing provider trips the circuit and `isProviderHealthy()` returns false.
3. (Optional) Update `getProviderHealth()` to return `recommendation: null` when all providers are OPEN, or to only recommend providers with state != 'open'.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

Comment on lines +114 to +122
const limit = opts?.limit ?? 100;
let query = this.stepsCollection(runId).orderBy('startedAt').limit(limit + 1);

if (opts?.cursor) {
const cursorDoc = await this.stepsCollection(runId).doc(opts.cursor).get();
if (cursorDoc.exists) {
query = query.startAfter(cursorDoc);
}
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Action required

6. Step pagination inconsistent 🐞 Bug ⛯ Reliability

StepStore pagination/order differs between Firestore (orderBy startedAt) and in-memory (insertion
order), and Firestore pagination uses cursor doc snapshots while startedAt is optional—this can lead
to inconsistent ordering across environments or brittle cursor behavior.
Agent Prompt
## Issue description
`FirestoreStepStore` orders by `startedAt` and uses doc-snapshot cursors, while `InMemoryStepStore` doesn’t sort at all. Since `startedAt` is optional, ordering and pagination can diverge or become brittle.

## Issue Context
- Firestore uses `orderBy('startedAt')`.
- In-memory uses Map insertion order.
- `RunStep.startedAt` and `StepDoc.startedAt` are optional.

## Fix Focus Areas
- packages/core/src/storage/firestore-step.ts[114-122]
- packages/core/src/storage/inmemory.ts[1091-1110]
- packages/core/src/storage/interfaces.ts[114-126]
- packages/core/src/storage/firestore-step.ts[31-43]

## Suggested approach
1. Decide on a required ordering field (e.g., `startedAt` becomes required at creation time, or add `createdAt` that is always set).
2. Update both Firestore and InMemory implementations to sort by the same keys.
3. Consider adding a tie-breaker (e.g., also orderBy document ID) for deterministic pagination.
4. Add tests that assert ordering + cursor pagination stability.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 9

🧹 Nitpick comments (1)
000-docs/242-PP-RMAP-automaton-integration-strategy.md (1)

23-30: Consider adding language specifiers to fenced code blocks.

The ASCII architecture diagrams lack language specifiers, which triggers markdownlint MD040 warnings. Adding text or plaintext would satisfy linters and improve rendering consistency.

📝 Suggested fix
-```
+```text
 Conway Cloud
   └── Automaton (standalone)

Apply similarly to the diagrams at lines 54 and 89.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@000-docs/242-PP-RMAP-automaton-integration-strategy.md` around lines 23 - 30,
The fenced ASCII diagrams (e.g., the block showing "Conway Cloud └── Automaton
(standalone) ├── SOUL.md ...") should include a language specifier to satisfy
markdownlint MD040; update those triple-backtick fences around that diagram and
the other similar diagrams (the ones later in the doc) to use ```text or
```plaintext so linters and renderers treat them as plain text.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@apps/api/src/index.ts`:
- Around line 1785-1810: Add Zod input validation and bounds to the run-steps
endpoints: define Zod schemas that validate and coerce the path params
(tenantId, runId) and query params (limit and cursor) with min/max constraints
(e.g. limit coerced to number with a max like 100 and a sensible min like 1,
cursor optional string) and apply them at the start of the handlers for the
'/tenants/:tenantId/runs/:runId/steps' list route (where getEngine() and
stepStore.listSteps are used) and the corresponding single-step route; replace
the current direct parseInt(req.query.limit) and unchecked req.params access
with validated/coerced values and return 400 on validation failure before
calling eng.getRun or stepStore methods.
- Around line 687-704: The /health/providers route is currently public; protect
it by applying the existing internalOnlyMiddleware to the route so only internal
callers can access provider circuit-breaker data. Update the app.get handler
that calls core.getProviderHealth() to include internalOnlyMiddleware (similar
to the /metrics route) before the async handler (or alternatively wrap the
handler with authMiddleware/internalOnlyMiddleware), ensuring the same
middleware symbol name internalOnlyMiddleware is imported/available in the file
and used on the app.get('/health/providers', ...) registration.

In `@apps/api/src/routes/budget.ts`:
- Around line 13-31: Add a bounded Zod schema to validate req.params.tenantId
before calling getTenant, setTenantPlan, or getPlanUsageStatus: import z from
'zod', define a schema (e.g., const tenantIdSchema = z.string().min(1).max(64)
or z.string().uuid() depending on expected format) and call
tenantIdSchema.safeParse(req.params.tenantId); if validation fails return
res.status(400).json({ error: 'Invalid tenantId' }); otherwise use the parsed
value (e.g., parsed.tenantId) when calling getTenant, setTenantPlan, and
getPlanUsageStatus so no raw external input is passed directly to those
functions.

In `@apps/api/src/routes/quota.ts`:
- Around line 43-49: getRecommendation currently computes percentUsed by
dividing by status.limit which can be 0/undefined and yield NaN/Infinity causing
false throttles; update getRecommendation to first guard that status.limit is a
positive number (e.g., if (!status.limit || status.limit <= 0) continue) before
computing percentUsed, so the remaining check still applies but percentage logic
only runs when limit > 0; reference the function getRecommendation and the
fields ActionQuotaStatus.limit and .remaining when making this change.
- Around line 105-110: The current logger.info call ('Rate limit quota queried')
in apps/api/src/routes/quota.ts is missing required A2A audit fields; update the
handler (the route function that calls logger.info) to emit an audit-style log
entry including timestamp (new Date().toISOString()), spiffeId (from request
context/agent auth), taskId (request or generated trace id), operation (e.g.,
'quota.query' or 'quota.ratelimit.query'), and duration (compute using a start
time captured at handler entry and end time before logging), either by extending
the existing logger.info payload with these keys or by emitting a separate audit
logger call with those fields alongside tenantId, plan, and recommendation so
the endpoint conforms to the A2A audit requirements.
- Around line 16-40: Validate req.params.tenantId with a bounded Zod schema
before any storage or rate-limiter calls: define a z.string().min(1).max(64) (or
project-bound) schema and parse tenantId at the top of the route handler,
returning HTTP 400 on parse failure; then use the validated value (not raw
req.params.tenantId) when calling getTenantStore().getTenant(tenantId) and
limiter.status(tenantId, action). Ensure both the tenant lookup and rate-limiter
usage reference the validated variable and add tests for invalid tenantId cases.

In `@infra/budget.tf`:
- Around line 9-31: There are duplicate Terraform declarations for variables
enable_budget_alerts, billing_account_id, budget_amount_usd,
budget_alert_thresholds and the resource google_pubsub_topic.budget_alerts;
remove the duplicates by consolidating the variable definitions into a single
shared variables file (e.g., variables.tf) and keep only one declaration of
google_pubsub_topic.budget_alerts (or move it to the intended monitoring/budget
module), then update any references to those symbols to point to the single
source of truth so OpenTofu can plan/apply without duplicate resource/variable
errors.

In `@packages/core/src/llm/__tests__/resilient-provider.test.ts`:
- Around line 11-19: The test's mock provider currently hard-codes model names;
update the mock used in resilient-provider.test.ts (the provider passed into
wrapWithResilience) to read its model identifier from a test configuration or
environment variable instead of using literal strings, e.g., import or reference
a shared TEST_MODEL constant or process.env.TEST_MODEL (fallback to a single
default for CI) and use that variable everywhere the mock returns or checks
model names so all occurrences (including the areas corresponding to the earlier
mock sections and later assertions) remain in sync with environment/model
rotation.

In `@packages/core/src/llm/provider-health.ts`:
- Around line 116-143: getProviderHealth currently iterates a hard-coded
providerTypes array which can include providers not registered (e.g., "vertex")
and may recommend an OPEN circuit; change it to obtain the provider list from
the provider registry (use the registry accessor used where providers are
registered) instead of the hard-coded LLMProviderType array, call
buildProviderStatus(cb, pt) for each registered provider, compute the stateScore
as before but exclude providers whose status.state === 'open' from being
considered for recommendation, and keep the rest of the scoring logic so
recommendation only points to a registered, non-open provider; ensure variables
bestProvider and providers still populate correctly and timestamp remains.

---

Nitpick comments:
In `@000-docs/242-PP-RMAP-automaton-integration-strategy.md`:
- Around line 23-30: The fenced ASCII diagrams (e.g., the block showing "Conway
Cloud └── Automaton (standalone) ├── SOUL.md ...") should include a language
specifier to satisfy markdownlint MD040; update those triple-backtick fences
around that diagram and the other similar diagrams (the ones later in the doc)
to use ```text or ```plaintext so linters and renderers treat them as plain
text.

Comment on lines +687 to +704
/**
* GET /health/providers — Circuit breaker state per LLM provider (gwi-d1k)
*
* Agent-queryable: returns which providers are healthy, degraded, or tripped.
* Agents check this before requesting runs to pick a healthy provider.
*/
app.get('/health/providers', async (_req, res) => {
try {
const core = await import('@gwi/core');
const report = core.getProviderHealth();
res.json(report);
} catch (error) {
res.status(500).json({
error: 'Failed to get provider health',
message: error instanceof Error ? error.message : 'Unknown error',
});
}
});
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# Search for documentation about /health/providers endpoint
rg -n "health/providers" -g "*.md" -g "*.yaml" -g "*.yml" -g "*.json" -g "*.ts" -g "*.tsx"

Repository: intent-solutions-io/iam-git-with-intent

Length of output: 262


🏁 Script executed:

# Check for middleware patterns and authentication in the same file
fd "index.ts" | xargs grep -l "app.get\|internalOnlyMiddleware\|authMiddleware" | head -5

Repository: intent-solutions-io/iam-git-with-intent

Length of output: 184


🏁 Script executed:

# Look for openapi or swagger documentation
fd -e yaml -e yml -e json -e ts | xargs grep -l "openapi\|swagger\|paths.*health" | head -10

Repository: intent-solutions-io/iam-git-with-intent

Length of output: 442


🏁 Script executed:

# Check how other health endpoints are implemented
rg -B5 -A5 "app.get.*health" apps/api/src/index.ts | head -50

Repository: intent-solutions-io/iam-git-with-intent

Length of output: 507


🏁 Script executed:

# Check openapi.yaml for health endpoint documentation
cat apps/gateway/openapi.yaml | grep -A 10 -B 2 "health" | head -50

Repository: intent-solutions-io/iam-git-with-intent

Length of output: 65


🏁 Script executed:

# Search for internalOnlyMiddleware definition and usage
rg -n "internalOnlyMiddleware\|authMiddleware" apps/api/src/index.ts | head -20

Repository: intent-solutions-io/iam-git-with-intent

Length of output: 65


🏁 Script executed:

# Check if there are other health endpoints and how they're protected
rg -B 2 -A 8 "app\.get\('\/health" apps/api/src/index.ts

Repository: intent-solutions-io/iam-git-with-intent

Length of output: 179


🏁 Script executed:

# Look for SPIFFE ID checks or authentication patterns in the file
rg -n "SPIFFE\|auth\|authenticated\|internal" apps/api/src/index.ts | head -30

Repository: intent-solutions-io/iam-git-with-intent

Length of output: 65


🏁 Script executed:

# Fix regex and search for health endpoints
rg -B 2 -A 8 "app.get.*health" apps/api/src/index.ts

Repository: intent-solutions-io/iam-git-with-intent

Length of output: 495


🏁 Script executed:

# Check if there's middleware setup at the top of the file
head -100 apps/api/src/index.ts | grep -n "middleware\|app.use\|Express"

Repository: intent-solutions-io/iam-git-with-intent

Length of output: 118


🏁 Script executed:

# Search for middleware definitions anywhere in the project
fd -e ts -e tsx | xargs rg -l "internalOnlyMiddleware|authMiddleware" | head -10

Repository: intent-solutions-io/iam-git-with-intent

Length of output: 135


🏁 Script executed:

# Check the broader context of the endpoint - is there any authentication setup?
sed -n '680,710p' apps/api/src/index.ts

Repository: intent-solutions-io/iam-git-with-intent

Length of output: 1027


🏁 Script executed:

# Look for any documentation or comments about who can access health endpoints
rg -B 5 "health" apps/api/src/index.ts | head -80

Repository: intent-solutions-io/iam-git-with-intent

Length of output: 2105


🏁 Script executed:

# Check how internalOnlyMiddleware is defined and used elsewhere in the file
rg -A 10 "function internalOnlyMiddleware" apps/api/src/index.ts

Repository: intent-solutions-io/iam-git-with-intent

Length of output: 671


🏁 Script executed:

# Search for SPIFFE validation patterns in the codebase
rg -n "SPIFFE\|spiffe://" apps/api/src/index.ts | head -10

Repository: intent-solutions-io/iam-git-with-intent

Length of output: 65


🏁 Script executed:

# Check if there are other agent-facing endpoints and how they're protected
rg -B 3 -A 3 "agent|Agent" apps/api/src/index.ts | grep -A 3 -B 3 "app\\.get\|middleware" | head -40

Repository: intent-solutions-io/iam-git-with-intent

Length of output: 588


🏁 Script executed:

# Check complete internalOnlyMiddleware implementation
sed -n '730,760p' apps/api/src/index.ts

Repository: intent-solutions-io/iam-git-with-intent

Length of output: 1095


🏁 Script executed:

# Verify if authMiddleware or similar is used for agent endpoints
rg -B 2 "authMiddleware.*agent|Agent.*authMiddleware" apps/api/src/index.ts

Repository: intent-solutions-io/iam-git-with-intent

Length of output: 65


🏁 Script executed:

# Check what the `/tenants/:tenantId/runs/:runId/steps` endpoint looks like (found in results)
sed -n '1775,1843p' apps/api/src/index.ts

Repository: intent-solutions-io/iam-git-with-intent

Length of output: 2542


Gate /health/providers with internalOnlyMiddleware to prevent public exposure of operational health data.

This endpoint exposes circuit breaker states per LLM provider without authentication. Although intended for agents, it should be restricted to internal or authenticated access—similar to the /metrics endpoint in the same file, which uses internalOnlyMiddleware. Agent-facing endpoints elsewhere in the codebase (e.g., /tenants/:tenantId/runs/:runId/steps) are protected with authMiddleware.

-app.get('/health/providers', async (_req, res) => {
+app.get('/health/providers', internalOnlyMiddleware, async (_req, res) => {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/api/src/index.ts` around lines 687 - 704, The /health/providers route is
currently public; protect it by applying the existing internalOnlyMiddleware to
the route so only internal callers can access provider circuit-breaker data.
Update the app.get handler that calls core.getProviderHealth() to include
internalOnlyMiddleware (similar to the /metrics route) before the async handler
(or alternatively wrap the handler with authMiddleware/internalOnlyMiddleware),
ensuring the same middleware symbol name internalOnlyMiddleware is
imported/available in the file and used on the app.get('/health/providers', ...)
registration.

Comment on lines +1785 to +1810
app.get('/tenants/:tenantId/runs/:runId/steps', authMiddleware, tenantAuthMiddleware, requirePermission('run:read'), async (req, res) => {
const { tenantId, runId } = req.params;
const limit = parseInt(req.query.limit as string) || 50;
const cursor = req.query.cursor as string | undefined;

try {
// Verify run belongs to tenant
const eng = await getEngine();
const run = await eng.getRun(tenantId, runId);
if (!run) {
return res.status(404).json({ error: 'Run not found', runId });
}

// Use step store (Firestore subcollection in prod, in-memory in dev)
const { getStepStore } = await import('@gwi/core');
const stepStore = getStepStore();
const result = await stepStore.listSteps(runId, { limit, cursor });

res.json({ ...result, source: 'subcollection' });
} catch (error) {
console.error('Failed to list steps:', error);
res.status(500).json({
error: 'Failed to list steps',
message: error instanceof Error ? error.message : 'Unknown error',
});
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Add Zod validation and bounds for the run-steps endpoints.
limit/cursor (and path params) are accepted directly; without bounds this allows unbounded reads and violates the input-validation rule. Suggest Zod schemas with min/max and coercion, applied to both list and single-step routes.

✅ Proposed validation (list + single-step)
 app.get('/tenants/:tenantId/runs/:runId/steps', authMiddleware, tenantAuthMiddleware, requirePermission('run:read'), async (req, res) => {
-  const { tenantId, runId } = req.params;
-  const limit = parseInt(req.query.limit as string) || 50;
-  const cursor = req.query.cursor as string | undefined;
+  const paramsSchema = z.object({
+    tenantId: z.string().min(1),
+    runId: z.string().min(1),
+  });
+  const querySchema = z.object({
+    limit: z.coerce.number().int().min(1).max(100).default(50),
+    cursor: z.string().min(1).optional(),
+  });
+  const paramsParse = paramsSchema.safeParse(req.params);
+  const queryParse = querySchema.safeParse(req.query);
+  if (!paramsParse.success || !queryParse.success) {
+    return res.status(400).json({
+      error: 'Invalid request parameters',
+      details: [...(paramsParse.error?.errors ?? []), ...(queryParse.error?.errors ?? [])],
+    });
+  }
+  const { tenantId, runId } = paramsParse.data;
+  const { limit, cursor } = queryParse.data;
@@
 app.get('/tenants/:tenantId/runs/:runId/steps/:stepId', authMiddleware, tenantAuthMiddleware, requirePermission('run:read'), async (req, res) => {
-  const { tenantId, runId, stepId } = req.params;
+  const paramsSchema = z.object({
+    tenantId: z.string().min(1),
+    runId: z.string().min(1),
+    stepId: z.string().min(1),
+  });
+  const paramsParse = paramsSchema.safeParse(req.params);
+  if (!paramsParse.success) {
+    return res.status(400).json({
+      error: 'Invalid request parameters',
+      details: paramsParse.error.errors,
+    });
+  }
+  const { tenantId, runId, stepId } = paramsParse.data;

As per coding guidelines: “Use Zod schemas for validation of all external data” and “Use Zod schemas with bounded constraints for input validation”.

Also applies to: 1816-1842

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/api/src/index.ts` around lines 1785 - 1810, Add Zod input validation and
bounds to the run-steps endpoints: define Zod schemas that validate and coerce
the path params (tenantId, runId) and query params (limit and cursor) with
min/max constraints (e.g. limit coerced to number with a max like 100 and a
sensible min like 1, cursor optional string) and apply them at the start of the
handlers for the '/tenants/:tenantId/runs/:runId/steps' list route (where
getEngine() and stepStore.listSteps are used) and the corresponding single-step
route; replace the current direct parseInt(req.query.limit) and unchecked
req.params access with validated/coerced values and return 400 on validation
failure before calling eng.getRun or stepStore methods.

Comment on lines +13 to +31
import { Router } from 'express';
import {
expressRequireAuth,
expressRequirePermission,
type RBACRequest,
} from '@gwi/core';
import { createLogger } from '@gwi/core';

const logger = createLogger('budget-routes');

/**
* Determine agent-actionable recommendation based on budget usage
*/
function getRecommendation(percentUsed: number): 'normal' | 'conserve' | 'stop' {
if (percentUsed >= 95) return 'stop';
if (percentUsed >= 80) return 'conserve';
return 'normal';
}

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, locate and read the budget.ts file
find . -name "budget.ts" -type f | head -5

Repository: intent-solutions-io/iam-git-with-intent

Length of output: 114


🏁 Script executed:

# Search for tenantId usage in the codebase
rg -n "tenantId" --type ts --type tsx -B2 -A2 | head -100

Repository: intent-solutions-io/iam-git-with-intent

Length of output: 110


🏁 Script executed:

# Search for existing Zod schema patterns and validation
rg -n "z\.string\(\)|z\.object\(|safeParse|zod" --type ts --type tsx apps/api/src -C2 | head -150

Repository: intent-solutions-io/iam-git-with-intent

Length of output: 110


🏁 Script executed:

# Check for existing tenantId schemas or validation helpers
rg -n "TenantId|tenant.*schema|tenant.*validation" --type ts --type tsx -i | head -50

Repository: intent-solutions-io/iam-git-with-intent

Length of output: 110


🏁 Script executed:

# Read the budget.ts file to check actual code
wc -l ./apps/api/src/routes/budget.ts

Repository: intent-solutions-io/iam-git-with-intent

Length of output: 118


🏁 Script executed:

# Read the budget.ts file
cat -n ./apps/api/src/routes/budget.ts

Repository: intent-solutions-io/iam-git-with-intent

Length of output: 5422


🏁 Script executed:

# Search for tenantId usage without tsx type
rg -n "tenantId" ./apps/api/src -B2 -A2 | head -150

Repository: intent-solutions-io/iam-git-with-intent

Length of output: 8460


🏁 Script executed:

# Search for existing Zod patterns without tsx restriction
rg -n "z\.string\(\)|z\.object\(|safeParse|from 'zod'|import.*zod" ./apps/api/src -B1 -A1 | head -100

Repository: intent-solutions-io/iam-git-with-intent

Length of output: 6144


Validate tenantId with a bounded Zod schema before use.

req.params.tenantId is external input passed directly to getTenant(), setTenantPlan(), and getPlanUsageStatus() without validation. Add a bounded Zod schema and return 400 on invalid input.

🛡️ Suggested fix
 import { Router } from 'express';
+import { z } from 'zod';
 import {
   expressRequireAuth,
   expressRequirePermission,
   type RBACRequest,
 } from '@gwi/core';
 import { createLogger } from '@gwi/core';
 
 const logger = createLogger('budget-routes');
+
+const TenantIdSchema = z.string().min(1).max(128);
 
 /**
  * Determine agent-actionable recommendation based on budget usage
  */
@@ -63,6 +67,11 @@ export function createBudgetRoutes(): Router {
     expressRequirePermission('tenant:read'),
     async (req: RBACRequest, res) => {
-      const { tenantId } = req.params;
+      const tenantIdResult = TenantIdSchema.safeParse(req.params.tenantId);
+      if (!tenantIdResult.success) {
+        return res.status(400).json({ error: 'Invalid tenantId' });
+      }
+      const tenantId = tenantIdResult.data;
 
       try {

Per coding guidelines: "Use Zod schemas for validation of all external data" and "Use Zod schemas with bounded constraints for input validation".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/api/src/routes/budget.ts` around lines 13 - 31, Add a bounded Zod schema
to validate req.params.tenantId before calling getTenant, setTenantPlan, or
getPlanUsageStatus: import z from 'zod', define a schema (e.g., const
tenantIdSchema = z.string().min(1).max(64) or z.string().uuid() depending on
expected format) and call tenantIdSchema.safeParse(req.params.tenantId); if
validation fails return res.status(400).json({ error: 'Invalid tenantId' });
otherwise use the parsed value (e.g., parsed.tenantId) when calling getTenant,
setTenantPlan, and getPlanUsageStatus so no raw external input is passed
directly to those functions.

Comment on lines +16 to +40
import { Router } from 'express';
import {
expressRequireAuth,
expressRequirePermission,
type RBACRequest,
} from '@gwi/core';
import {
getRateLimiter,
DEFAULT_RATE_LIMITS,
type RateLimitResult,
} from '@gwi/core';
import { createLogger } from '@gwi/core';

const logger = createLogger('quota-routes');

/**
* Agent-queryable rate limit status for a single action
*/
interface ActionQuotaStatus {
limit: number;
remaining: number;
resetsAt: string;
}

/**
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check the actual file and look for tenantId validation
echo "=== Reading quota.ts file ==="
wc -l apps/api/src/routes/quota.ts
head -80 apps/api/src/routes/quota.ts

echo ""
echo "=== Search for tenantId usage patterns ==="
rg "tenantId" apps/api/src/routes/quota.ts -n -B2 -A2

echo ""
echo "=== Search for existing Zod schemas in routes directory ==="
rg "z\.(string|object|number)" apps/api/src/routes -g '*.ts' -n | head -20

echo ""
echo "=== Search for validation patterns in api ==="
rg "safeParse|parse|validate" apps/api/src/routes -g '*.ts' -n | head -20

Repository: intent-solutions-io/iam-git-with-intent

Length of output: 7090


🏁 Script executed:

#!/bin/bash
echo "=== Search for existing tenantId schema or validation definitions ==="
rg -n "tenantId.*Schema|TenantId.*z\." apps/ -g '*.ts' -B2 -A2

echo ""
echo "=== Search for common tenant validation patterns ==="
rg -n "z\.string\(\).*max\(" apps/ -g '*.ts' | grep -i tenant

echo ""
echo "=== Check if `@gwi/core` exports tenantId validation ==="
rg -n "export.*tenant|TenantId" apps/ -g '*.ts' | head -15

Repository: intent-solutions-io/iam-git-with-intent

Length of output: 1654


🏁 Script executed:

#!/bin/bash
echo "=== Check lines 66-72 in quota.ts for the second issue ==="
sed -n '60,80p' apps/api/src/routes/quota.ts

echo ""
echo "=== Search for how other routes validate params (not body) ==="
rg -n "req\.params" apps/api/src/routes -g '*.ts' -B3 -A3 | head -50

Repository: intent-solutions-io/iam-git-with-intent

Length of output: 3849


🏁 Script executed:

#!/bin/bash
echo "=== Check if there are any middleware that validate tenantId globally ==="
rg -n "middleware|Middleware" apps/api/src/routes/quota.ts -i

echo ""
echo "=== Verify that no validation happens before tenantId use ==="
sed -n '66,90p' apps/api/src/routes/quota.ts

echo ""
echo "=== Check how quotas.ts handles tenantId for comparison ==="
rg -n "req\.params\.tenantId" apps/api/src/routes/quotas.ts -B2 -A5 | head -30

Repository: intent-solutions-io/iam-git-with-intent

Length of output: 1140


🏁 Script executed:

#!/bin/bash
echo "=== Read the quotas.ts file to see full validation pattern ==="
sed -n '1,150p' apps/api/src/routes/quotas.ts | head -100

echo ""
echo "=== Check if quotas.ts validates tenantId in its routes ==="
rg -n "tenantId" apps/api/src/routes/quotas.ts -B4 -A4 | head -40

Repository: intent-solutions-io/iam-git-with-intent

Length of output: 3936


🏁 Script executed:

#!/bin/bash
echo "=== Check core package for tenant validation or size constraints ==="
find . -path '*/node_modules' -prune -o -name '@gwi' -type d -print 2>/dev/null | head -5

echo ""
echo "=== Look for typical tenant ID length constraints in docs or config ==="
rg -n "tenant.*size|tenant.*length|tenant.*max|tenant.*constraint" --type-list | head -10
find . -path '*/node_modules' -prune -o -name '*.md' -type f -exec grep -l "tenant.*id\|tenantId" {} \; 2>/dev/null | head -5

echo ""
echo "=== Check if there's any validation middleware in the API setup ==="
sed -n '1,100p' apps/api/src/index.ts

Repository: intent-solutions-io/iam-git-with-intent

Length of output: 4161


🏁 Script executed:

#!/bin/bash
echo "=== Final verification: confirm tenantId is used in storage/limiter without prior validation ==="
sed -n '66,100p' apps/api/src/routes/quota.ts | grep -n "tenantId\|getTenant\|limiter"

Repository: intent-solutions-io/iam-git-with-intent

Length of output: 562


Add Zod validation for tenantId before storage queries.

req.params.tenantId is external input used directly in getTenantStore().getTenant(tenantId) and limiter.status(tenantId, action) without validation. Validate with a bounded Zod schema and return 400 on invalid input.

🛡️ Suggested fix
 import { Router } from 'express';
+import { z } from 'zod';
 import {
   expressRequireAuth,
   expressRequirePermission,
   type RBACRequest,
 } from '@gwi/core';
@@
 interface ActionQuotaStatus {
   limit: number;
   remaining: number;
   resetsAt: string;
 }
+
+const TenantIdSchema = z.string().min(1).max(128);
@@
-      const { tenantId } = req.params;
+      const tenantIdResult = TenantIdSchema.safeParse(req.params.tenantId);
+      if (!tenantIdResult.success) {
+        return res.status(400).json({ error: 'Invalid tenantId' });
+      }
+      const tenantId = tenantIdResult.data;

Aligns with "Use Zod schemas for validation of all external data" and "Use Zod schemas with bounded constraints for input validation".

Also applies to: 66-72

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/api/src/routes/quota.ts` around lines 16 - 40, Validate
req.params.tenantId with a bounded Zod schema before any storage or rate-limiter
calls: define a z.string().min(1).max(64) (or project-bound) schema and parse
tenantId at the top of the route handler, returning HTTP 400 on parse failure;
then use the validated value (not raw req.params.tenantId) when calling
getTenantStore().getTenant(tenantId) and limiter.status(tenantId, action).
Ensure both the tenant lookup and rate-limiter usage reference the validated
variable and add tests for invalid tenantId cases.

Comment on lines +43 to +49
function getRecommendation(
statuses: Record<string, ActionQuotaStatus>,
): 'normal' | 'throttle' {
for (const status of Object.values(statuses)) {
if (status.remaining <= 0) return 'throttle';
const percentUsed = ((status.limit - status.remaining) / status.limit) * 100;
if (percentUsed >= 80) return 'throttle';
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Guard against zero/undefined limits to avoid NaN/Infinity throttle.

percentUsed divides by status.limit; if the limiter returns 0 (unlimited or misconfigured), the calculation becomes Infinity/NaN and forces throttle. Add a non‑positive limit guard before the calculation.

🛠️ Suggested fix
-  for (const status of Object.values(statuses)) {
-    if (status.remaining <= 0) return 'throttle';
-    const percentUsed = ((status.limit - status.remaining) / status.limit) * 100;
+  for (const status of Object.values(statuses)) {
+    if (status.limit <= 0) continue;
+    if (status.remaining <= 0) return 'throttle';
+    const percentUsed = ((status.limit - status.remaining) / status.limit) * 100;
     if (percentUsed >= 80) return 'throttle';
   }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/api/src/routes/quota.ts` around lines 43 - 49, getRecommendation
currently computes percentUsed by dividing by status.limit which can be
0/undefined and yield NaN/Infinity causing false throttles; update
getRecommendation to first guard that status.limit is a positive number (e.g.,
if (!status.limit || status.limit <= 0) continue) before computing percentUsed,
so the remaining check still applies but percentage logic only runs when limit >
0; reference the function getRecommendation and the fields
ActionQuotaStatus.limit and .remaining when making this change.

Comment on lines +105 to +110
logger.info('Rate limit quota queried', {
eventName: 'quota.ratelimit.query',
tenantId,
plan,
recommendation,
});
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# First, let's run the provided search pattern to find audit logging patterns
rg -n -C2 "audit|A2A|spiffe|taskId|operationType|duration" apps/api/src packages -g '*.ts' 2>&1 | head -100

Repository: intent-solutions-io/iam-git-with-intent

Length of output: 5875


🏁 Script executed:

#!/bin/bash
# Let's examine the quota.ts file to understand the context
cat -n apps/api/src/routes/quota.ts 2>/dev/null | head -150

Repository: intent-solutions-io/iam-git-with-intent

Length of output: 4678


🏁 Script executed:

#!/bin/bash
# Check if there's a middleware directory or logging utilities
find apps/api/src -type f -name '*middleware*' -o -name '*audit*' -o -name '*log*' | head -20

Repository: intent-solutions-io/iam-git-with-intent

Length of output: 65


🏁 Script executed:

#!/bin/bash
# Search for the operation type definition and see if there's a global logging pattern
rg -n "operationType|logger\.(info|debug|audit)" apps/api/src/routes -A 2 -B 2 | head -150

Repository: intent-solutions-io/iam-git-with-intent

Length of output: 4186


Add required A2A audit log fields for agent-facing quota endpoint.

The endpoint is explicitly agent-first (per the module comment), making this an A2A operation. The log entry at lines 105-110 lacks required audit fields: timestamp (ISO 8601), SPIFFE ID, task ID, operation type, and duration. Either emit a dedicated audit log entry with these fields or confirm a global middleware adds them for this endpoint.

Per coding guidelines: "Implement audit logging for all A2A operations with timestamp (ISO 8601), SPIFFE ID, task ID, operation type, and duration".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/api/src/routes/quota.ts` around lines 105 - 110, The current logger.info
call ('Rate limit quota queried') in apps/api/src/routes/quota.ts is missing
required A2A audit fields; update the handler (the route function that calls
logger.info) to emit an audit-style log entry including timestamp (new
Date().toISOString()), spiffeId (from request context/agent auth), taskId
(request or generated trace id), operation (e.g., 'quota.query' or
'quota.ratelimit.query'), and duration (compute using a start time captured at
handler entry and end time before logging), either by extending the existing
logger.info payload with these keys or by emitting a separate audit logger call
with those fields alongside tenantId, plan, and recommendation so the endpoint
conforms to the A2A audit requirements.

Comment on lines +9 to +31
variable "enable_budget_alerts" {
description = "Enable GCP billing budget alerts"
type = bool
default = false
}

variable "budget_amount_usd" {
description = "Monthly budget amount in USD"
type = number
default = 500
}

variable "budget_alert_thresholds" {
description = "Alert threshold percentages (0.0-1.0)"
type = list(number)
default = [0.5, 0.8, 0.9, 1.0]
}

variable "billing_account_id" {
description = "GCP Billing Account ID (required if enable_budget_alerts = true)"
type = string
default = ""
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# First, check if infra/budget.tf exists and get its line count
ls -lh infra/budget.tf
wc -l infra/budget.tf

# Find duplicate variable declarations
echo "=== Searching for duplicate: enable_budget_alerts ==="
rg -n 'variable "enable_budget_alerts"' infra -g '*.tf'

echo "=== Searching for duplicate: billing_account_id ==="
rg -n 'variable "billing_account_id"' infra -g '*.tf'

echo "=== Searching for duplicate resource: google_pubsub_topic budget_alerts ==="
rg -n 'resource "google_pubsub_topic" "budget_alerts"' infra -g '*.tf'

# List all .tf files in infra to understand structure
echo "=== All Terraform files in infra/ ==="
find infra -name '*.tf' -type f

Repository: intent-solutions-io/iam-git-with-intent

Length of output: 1116


🏁 Script executed:

# Read the full infra/budget.tf file to see all content
cat -n infra/budget.tf

Repository: intent-solutions-io/iam-git-with-intent

Length of output: 3587


🏁 Script executed:

# Read the duplicate definitions in monitoring.tf
sed -n '831,860p' infra/monitoring.tf

Repository: intent-solutions-io/iam-git-with-intent

Length of output: 879


Resolve duplicate variable and resource declarations blocking OpenTofu plan.

Duplicates for enable_budget_alerts, billing_account_id, and google_pubsub_topic.budget_alerts are declared in both infra/monitoring.tf (lines 831, 837, 849) and infra/budget.tf (lines 9, 27, 37). These will prevent OpenTofu plan and apply from succeeding. Deduplicate by consolidating definitions into a single file, or if keeping both files, move the duplicates to the shared variables.tf and remove from here.

Also applies to: 37-45, 51-78

🧰 Tools
🪛 GitHub Actions: OpenTofu Plan

[error] 9-9: OpenTofu: Duplicate variable declaration for "enable_budget_alerts". A variable named "enable_budget_alerts" is defined multiple times within the module.


[error] 27-27: OpenTofu: Duplicate variable declaration for "billing_account_id". A variable named "billing_account_id" is defined multiple times within the module.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@infra/budget.tf` around lines 9 - 31, There are duplicate Terraform
declarations for variables enable_budget_alerts, billing_account_id,
budget_amount_usd, budget_alert_thresholds and the resource
google_pubsub_topic.budget_alerts; remove the duplicates by consolidating the
variable definitions into a single shared variables file (e.g., variables.tf)
and keep only one declaration of google_pubsub_topic.budget_alerts (or move it
to the intended monitoring/budget module), then update any references to those
symbols to point to the single source of truth so OpenTofu can plan/apply
without duplicate resource/variable errors.

Comment on lines +11 to +19
import { describe, it, expect, beforeEach } from 'vitest';
import { wrapWithResilience } from '../resilient-provider.js';
import {
getProviderHealth,
isProviderHealthy,
tripProvider,
resetProvider,
resetLLMCircuitBreaker,
} from '../provider-health.js';
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Avoid hard-coded model names in the mock provider.

Use a config-driven model identifier so tests stay aligned with environment/model rotation.

🔧 Suggested update
-import { wrapWithResilience } from '../resilient-provider.js';
+import { wrapWithResilience } from '../resilient-provider.js';
+import { getDefaultModel as getDefaultModelForProvider } from '../provider-discovery.js';

+const MODEL_ID = process.env.GWI_LLM_MODEL ?? getDefaultModelForProvider('anthropic');

 function createMockProvider(opts?: {
   failAfter?: number;
   failWith?: Error;
 }): LLMProvider & { callCount: number } {
   let callCount = 0;
   const failAfter = opts?.failAfter ?? Infinity;
   const failWith = opts?.failWith ?? new Error('Provider error (status 500)');
@@
     getModel() {
-      return 'claude-sonnet-4-20250514';
+      return MODEL_ID;
     },
@@
       return {
         json: { result: 'ok' },
         raw: '{"result":"ok"}',
         provider: 'anthropic',
-        model: 'claude-sonnet-4-20250514',
+        model: MODEL_ID,
         latencyMs: 100,
       };
@@
       return {
         text: 'Hello',
         provider: 'anthropic',
-        model: 'claude-sonnet-4-20250514',
+        model: MODEL_ID,
         latencyMs: 100,
       };
@@
-    expect(resilient.getModel()).toBe('claude-sonnet-4-20250514');
+    expect(resilient.getModel()).toBe(MODEL_ID);

As per coding guidelines, "Do not hard-code model names; use configuration and environment variables instead".

Also applies to: 32-79, 147-154

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/core/src/llm/__tests__/resilient-provider.test.ts` around lines 11 -
19, The test's mock provider currently hard-codes model names; update the mock
used in resilient-provider.test.ts (the provider passed into wrapWithResilience)
to read its model identifier from a test configuration or environment variable
instead of using literal strings, e.g., import or reference a shared TEST_MODEL
constant or process.env.TEST_MODEL (fallback to a single default for CI) and use
that variable everywhere the mock returns or checks model names so all
occurrences (including the areas corresponding to the earlier mock sections and
later assertions) remain in sync with environment/model rotation.

Comment on lines +116 to +143
export function getProviderHealth(): ProviderHealthReport {
const cb = getLLMCircuitBreaker();
const providerTypes: LLMProviderType[] = ['anthropic', 'google', 'openai', 'vertex', 'openai_compat'];

const providers: Partial<Record<LLMProviderType, ProviderHealthStatus>> = {};
let bestProvider: LLMProviderType | null = null;
let bestScore = -1;

for (const pt of providerTypes) {
const status = buildProviderStatus(cb, pt);
providers[pt] = status;

// Score: closed = 100, half_open = 50, open = 0, minus failure rate
const stateScore = status.state === 'closed' ? 100 : status.state === 'half_open' ? 50 : 0;
const score = stateScore - status.failureRate;

if (score > bestScore) {
bestScore = score;
bestProvider = pt;
}
}

return {
providers,
recommendation: bestProvider,
timestamp: new Date().toISOString(),
};
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's check the LLMProviderType definition
rg -n "type LLMProviderType|enum LLMProviderType" packages/core/src/llm/ -A 5

# Check provider-capabilities.ts for the full provider list
rg -n "PROVIDER_CAPABILITIES|export.*Provider" packages/core/src/llm/provider-capabilities.ts -A 3 | head -100

# Check what providers are registered in index.ts or similar
rg -n "register|supported.*provider|Provider.*list" packages/core/src/llm/index.ts -i -A 2 | head -50

Repository: intent-solutions-io/iam-git-with-intent

Length of output: 3181


🏁 Script executed:

# Check if there's a getAllProviders function or how the registry list is accessed
rg -n "getAllProviders|registry\.list|vertex" packages/core/src/llm/ -A 2 | head -80

# Check where getProviderHealth is called
rg -n "getProviderHealth" packages/core/src/ -B 2 -A 2

Repository: intent-solutions-io/iam-git-with-intent

Length of output: 6313


🏁 Script executed:

# Check if vertex should be registered or is intentionally excluded
rg -n "createVertexProvider" packages/core/src/llm/ -B 2 -A 2

# Check if the provider registry initialization should register vertex
git log --oneline -n 20 -- packages/core/src/llm/index.ts 2>/dev/null | head -10

Repository: intent-solutions-io/iam-git-with-intent

Length of output: 1000


Recommendation can point to an unregistered provider.

The hard-coded provider list includes 'vertex', which exists in the LLMProviderType type definition but is not registered in the provider registry (only google, anthropic, openai, and openai_compat are registered at line 181-184 in index.ts). If getProviderHealth() recommends vertex, it will fail at runtime when agents attempt to use it. Additionally, the provider list should be derived from the registry to stay in sync.

For the OPEN state concern: while circuits in the open state score 0 after subtracting failure rate (which is negative if failures exist), it's still safer to explicitly skip them for recommendation to handle edge cases.

⚙️ Suggested fix
   for (const pt of providerTypes) {
     const status = buildProviderStatus(cb, pt);
     providers[pt] = status;

+    if (status.state === 'open') {
+      continue;
+    }
-    // Score: closed = 100, half_open = 50, open = 0, minus failure rate
-    const stateScore = status.state === 'closed' ? 100 : status.state === 'half_open' ? 50 : 0;
+    // Score: closed = 100, half_open = 50, minus failure rate
+    const stateScore = status.state === 'closed' ? 100 : 50;
     const score = stateScore - status.failureRate;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/core/src/llm/provider-health.ts` around lines 116 - 143,
getProviderHealth currently iterates a hard-coded providerTypes array which can
include providers not registered (e.g., "vertex") and may recommend an OPEN
circuit; change it to obtain the provider list from the provider registry (use
the registry accessor used where providers are registered) instead of the
hard-coded LLMProviderType array, call buildProviderStatus(cb, pt) for each
registered provider, compute the stateScore as before but exclude providers
whose status.state === 'open' from being considered for recommendation, and keep
the rest of the scoring logic so recommendation only points to a registered,
non-open provider; ensure variables bestProvider and providers still populate
correctly and timestamp remains.

@jeremylongshore jeremylongshore merged commit 10989d0 into main Feb 20, 2026
11 of 12 checks passed
@jeremylongshore jeremylongshore deleted the feat/scale-ops-maturity branch February 20, 2026 00:14
jeremylongshore added a commit that referenced this pull request Feb 20, 2026
Release prepared by: jeremy
Approval timestamp: 2026-02-20T00:21:16Z
Approval SHA: 10989d0

Changes in this release:
- Scale & Ops Maturity epic: agent-first infrastructure (#86)
- Harness engineering hooks + README currency + BSL 1.1 (#85)
- Code Factory gap closure: incident-to-harness + browser evidence (#84)

Co-Authored-By: Claude <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant