-
Notifications
You must be signed in to change notification settings - Fork 15
Description
Context
In the current M2M Credentials architecture (multi-tenant.md § "M2M Credentials via Secret Manager"), the credential cache is in-memory only (sync.Map per pod instance). This creates two gaps in production multi-tenant deployments:
- Cache inconsistency between pods — each pod has its own TTL cycle. After a credential rotation or re-fetch, other pods continue using stale cached credentials until their own TTL expires independently.
- Cache-bust not propagated — if a credential is revoked and one pod detects a 401 on token exchange, the other pods have no way to know and keep using the revoked credential for up to 5 minutes (default TTL).
Proposal
Two-level cache architecture
| Level | Store | TTL | Purpose |
|---|---|---|---|
| L1 | In-memory (sync.Map) |
Short (~30s) | Fast path, avoid Redis round-trip per request |
| L2 | Redis/Valkey (distributed) | M2M_CREDENTIAL_CACHE_TTL_SEC (default 300s) |
Source of truth, shared across all pods |
Fallback: if Redis is not available (dev, single-tenant), in-memory becomes the only level (current behavior preserved).
Cache-bust on auth failure (401)
When the token exchange (client_credentials grant) returns 401:
- Delete the entry from L2 (Redis) — propagates to all pods
- Delete the entry from L1 (local) — immediate effect on current pod
- Re-fetch from AWS Secrets Manager on next request
This eliminates the up-to-5-minute window of using revoked credentials.
Key structure
tenant:{tenantOrgID}:m2m:{targetService}:credentials
Uses the existing valkey.GetKeyFromContext for tenant key prefixing.
Configuration
M2M_CREDENTIAL_CACHE_MODE=distributed | local
default: distributed (when Redis/Valkey is available)
fallback: local (when Redis is not configured)
Changes required
- lib-commons v3
secretsmanagerpackage — updateM2MCredentialProviderto support two-level cache with Redis as L2 - multi-tenant.md — update "M2M Credentials via Secret Manager" section with distributed cache architecture and cache-bust on 401 pattern
- dev-multi-tenant skill (Gate 5.5) — update implementation instructions to wire Redis cache when available
Additional: M2M metrics as mandatory
Currently the 4 M2M metrics (m2m_credential_cache_hits, m2m_credential_cache_misses, m2m_credential_fetch_errors, m2m_credential_fetch_duration_seconds) are listed as recommended. They should be mandatory in Gate 5.5, matching the 4 multi-tenant metrics that are already mandatory in Gate 7. Without them, diagnosing per-tenant M2M issues in production with many tenants is not feasible.
Origin
Discussion between Jeff and Gandalf (2026-03-16) reviewing the M2M credential architecture for plugin-to-product authentication in multi-tenant mode.