Skip to content

M2M Credential Cache: distributed (Redis/Valkey) as default, in-memory as fallback #279

@gandalf-at-lerian

Description

@gandalf-at-lerian

Context

In the current M2M Credentials architecture (multi-tenant.md § "M2M Credentials via Secret Manager"), the credential cache is in-memory only (sync.Map per pod instance). This creates two gaps in production multi-tenant deployments:

  1. Cache inconsistency between pods — each pod has its own TTL cycle. After a credential rotation or re-fetch, other pods continue using stale cached credentials until their own TTL expires independently.
  2. Cache-bust not propagated — if a credential is revoked and one pod detects a 401 on token exchange, the other pods have no way to know and keep using the revoked credential for up to 5 minutes (default TTL).

Proposal

Two-level cache architecture

Level Store TTL Purpose
L1 In-memory (sync.Map) Short (~30s) Fast path, avoid Redis round-trip per request
L2 Redis/Valkey (distributed) M2M_CREDENTIAL_CACHE_TTL_SEC (default 300s) Source of truth, shared across all pods

Fallback: if Redis is not available (dev, single-tenant), in-memory becomes the only level (current behavior preserved).

Cache-bust on auth failure (401)

When the token exchange (client_credentials grant) returns 401:

  1. Delete the entry from L2 (Redis) — propagates to all pods
  2. Delete the entry from L1 (local) — immediate effect on current pod
  3. Re-fetch from AWS Secrets Manager on next request

This eliminates the up-to-5-minute window of using revoked credentials.

Key structure

tenant:{tenantOrgID}:m2m:{targetService}:credentials

Uses the existing valkey.GetKeyFromContext for tenant key prefixing.

Configuration

M2M_CREDENTIAL_CACHE_MODE=distributed | local
  default: distributed (when Redis/Valkey is available)
  fallback: local (when Redis is not configured)

Changes required

  1. lib-commons v3 secretsmanager package — update M2MCredentialProvider to support two-level cache with Redis as L2
  2. multi-tenant.md — update "M2M Credentials via Secret Manager" section with distributed cache architecture and cache-bust on 401 pattern
  3. dev-multi-tenant skill (Gate 5.5) — update implementation instructions to wire Redis cache when available

Additional: M2M metrics as mandatory

Currently the 4 M2M metrics (m2m_credential_cache_hits, m2m_credential_cache_misses, m2m_credential_fetch_errors, m2m_credential_fetch_duration_seconds) are listed as recommended. They should be mandatory in Gate 5.5, matching the 4 multi-tenant metrics that are already mandatory in Gate 7. Without them, diagnosing per-tenant M2M issues in production with many tenants is not feasible.

Origin

Discussion between Jeff and Gandalf (2026-03-16) reviewing the M2M credential architecture for plugin-to-product authentication in multi-tenant mode.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions