M2M Credential Cache: distributed (Redis/Valkey) as default, in-memory as fallback

## Context

In the current M2M Credentials architecture (multi-tenant.md § "M2M Credentials via Secret Manager"), the credential cache is **in-memory only** (`sync.Map` per pod instance). This creates two gaps in production multi-tenant deployments:

1. **Cache inconsistency between pods** — each pod has its own TTL cycle. After a credential rotation or re-fetch, other pods continue using stale cached credentials until their own TTL expires independently.
2. **Cache-bust not propagated** — if a credential is revoked and one pod detects a 401 on token exchange, the other pods have no way to know and keep using the revoked credential for up to 5 minutes (default TTL).

## Proposal

### Two-level cache architecture

| Level | Store | TTL | Purpose |
|-------|-------|-----|---------|
| L1 | In-memory (`sync.Map`) | Short (~30s) | Fast path, avoid Redis round-trip per request |
| L2 | Redis/Valkey (distributed) | `M2M_CREDENTIAL_CACHE_TTL_SEC` (default 300s) | Source of truth, shared across all pods |

**Fallback:** if Redis is not available (dev, single-tenant), in-memory becomes the only level (current behavior preserved).

### Cache-bust on auth failure (401)

When the token exchange (client_credentials grant) returns 401:
1. Delete the entry from L2 (Redis) — propagates to all pods
2. Delete the entry from L1 (local) — immediate effect on current pod
3. Re-fetch from AWS Secrets Manager on next request

This eliminates the up-to-5-minute window of using revoked credentials.

### Key structure

```
tenant:{tenantOrgID}:m2m:{targetService}:credentials
```

Uses the existing `valkey.GetKeyFromContext` for tenant key prefixing.

### Configuration

```
M2M_CREDENTIAL_CACHE_MODE=distributed | local
  default: distributed (when Redis/Valkey is available)
  fallback: local (when Redis is not configured)
```

## Changes required

1. **lib-commons v3 `secretsmanager` package** — update `M2MCredentialProvider` to support two-level cache with Redis as L2
2. **multi-tenant.md** — update "M2M Credentials via Secret Manager" section with distributed cache architecture and cache-bust on 401 pattern
3. **dev-multi-tenant skill (Gate 5.5)** — update implementation instructions to wire Redis cache when available

## Additional: M2M metrics as mandatory

Currently the 4 M2M metrics (`m2m_credential_cache_hits`, `m2m_credential_cache_misses`, `m2m_credential_fetch_errors`, `m2m_credential_fetch_duration_seconds`) are listed as recommended. They should be **mandatory** in Gate 5.5, matching the 4 multi-tenant metrics that are already mandatory in Gate 7. Without them, diagnosing per-tenant M2M issues in production with many tenants is not feasible.

## Origin

Discussion between Jeff and Gandalf (2026-03-16) reviewing the M2M credential architecture for plugin-to-product authentication in multi-tenant mode.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

M2M Credential Cache: distributed (Redis/Valkey) as default, in-memory as fallback #279

Context

Proposal

Two-level cache architecture

Cache-bust on auth failure (401)

Key structure

Configuration

Changes required

Additional: M2M metrics as mandatory

Origin

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Level	Store	TTL	Purpose
L1	In-memory (`sync.Map`)	Short (~30s)	Fast path, avoid Redis round-trip per request
L2	Redis/Valkey (distributed)	`M2M_CREDENTIAL_CACHE_TTL_SEC` (default 300s)	Source of truth, shared across all pods

M2M Credential Cache: distributed (Redis/Valkey) as default, in-memory as fallback #279

Description

Context

Proposal

Two-level cache architecture

Cache-bust on auth failure (401)

Key structure

Configuration

Changes required

Additional: M2M metrics as mandatory

Origin

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions