feat(control-plane): CP /token endpoint for runner gRPC auth RHOAIENG-56711 by markturansky · Pull Request #1213 · ambient-code/platform

markturansky · 2026-04-04T22:10:55Z

Summary

Eliminates async BOT_TOKEN Secret push/refresh loop — replaces with synchronous pull from a new CP HTTP endpoint
New internal/tokenserver package — GET /token validates the caller's K8s SA token via TokenReview, checks system:serviceaccount:*:session-*-sa pattern, returns a fresh API token via OIDCTokenProvider
Runner updated — _fetch_token_from_cp() reads pod SA token from standard K8s mount, calls CP /token on startup and every gRPC reconnect; falls back to BOT_TOKEN env var for local dev when AMBIENT_CP_TOKEN_URL unset
kube_reconciler cleaned up — removes ensureSecret, StartTokenRefreshLoop, refreshRunnerToken, refreshAllRunningTokens; sets automountServiceAccountToken: true; injects AMBIENT_CP_TOKEN_URL instead of BOT_TOKEN secret ref

Design rationale

CP is the thing rotating the secret — if it's down for /token it's down for cycling too. Synchronous pull eliminates the 3-way race (CP ticker → kubelet propagation → runner read) and removes a class of stale-token failures.

Spec: docs/internal/design/control-plane.spec.md

Test plan

Deploy CP with CP_TOKEN_LISTEN_ADDR=:8080 and CP_TOKEN_URL=http://<cp-svc>:8080/token
Start a session; confirm runner pod has automountServiceAccountToken: true and AMBIENT_CP_TOKEN_URL env set
Confirm runner fetches token on startup (log: [GRPC CLIENT] Fetched fresh API token from CP token endpoint)
Force gRPC reconnect; confirm fresh token fetched via CP endpoint
Verify GET /healthz on token server returns 200
Confirm non-runner SA token is rejected (403)
Local dev: unset AMBIENT_CP_TOKEN_URL, set BOT_TOKEN; confirm fallback path works

🤖 Generated with Claude Code

Summary by CodeRabbit

New Features
- Control plane exposes a new /token endpoint that issues API tokens on demand.
- Runners now obtain tokens from the control plane (AMBIENT_CP_TOKEN_URL) using their ServiceAccount tokens; BOT_TOKEN secret is no longer used.
- Control plane can optionally ensure a NetworkPolicy to allow API-server access between namespaces.
Configuration
- Added CP token settings (CP_TOKEN_LISTEN_ADDR, CP_TOKEN_URL) to control plane config.
Documentation
- Design doc updated to describe the CP token endpoint, runner auth flow, and security implications.

…wal RHOAIENG-56711 Documents the synchronous token fetch endpoint on the CP as the replacement for async Secret-write renewal. Covers the problem (three-way async race against OIDC TTL), solution (GET /token with K8s SA TokenReview auth), bootstrap vs renewal split, new internal/tokenserver packages, runner reconnect changes, and AMBIENT_CP_TOKEN_URL env var injection. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

CP creates the runner pod, so it is always up before the runner's first token request. Retaining a Secret write loop adds complexity and a second failure mode with identical blast radius. SA token automount flipped to true; BOT_TOKEN removed from env vars table and bootstrap split removed. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

…h RHOAIENG-56711 Eliminates the async BOT_TOKEN Secret push+refresh loop in favour of a synchronous pull model: - Add internal/tokenserver package: HTTP server on :8080 with GET /token handler that validates the caller's K8s SA token via TokenReview, checks the runner SA pattern (system:serviceaccount:*:session-*-sa), then mints and returns a fresh API token via the existing OIDCTokenProvider. - Wire token server goroutine into runKubeMode() alongside the informer. - Add CPTokenListenAddr / CPTokenURL fields to ControlPlaneConfig and KubeReconcilerConfig; read from CP_TOKEN_LISTEN_ADDR / CP_TOKEN_URL env. - kube_reconciler: remove ensureSecret, StartTokenRefreshLoop, refreshRunnerToken, refreshAllRunningTokens; replace BOT_TOKEN secret env injection with AMBIENT_CP_TOKEN_URL; set automountServiceAccountToken true on SA and pod so the K8s-mounted SA token is available at the standard path. - runner _grpc_client.py: add _fetch_token_from_cp() that reads the pod SA token and calls the CP /token endpoint; from_env() calls it when AMBIENT_CP_TOKEN_URL is set, falls back to BOT_TOKEN env var for local dev; reconnect() refreshes via CP endpoint on every gRPC reconnect. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

coderabbitai · 2026-04-04T22:11:11Z

📝 Walkthrough

Walkthrough

Adds a control-plane HTTP token server (GET /token) that validates Kubernetes ServiceAccount bearer tokens via TokenReview and mints API tokens. Removes per-session secret-based token injection/refresh; runners use their mounted SA token to synchronously request API tokens from the CP endpoint.

Changes

Cohort / File(s)	Summary
Token Server `components/ambient-control-plane/internal/tokenserver/server.go`, `components/ambient-control-plane/internal/tokenserver/handler.go`	New HTTP server and handler exposing `/token` and `/healthz`. Validates SA bearer tokens with TokenReview, enforces runner SA naming, and mints/returns API tokens via `TokenProvider`. Includes lifecycle (start/shutdown) and health endpoint.
Control Plane Main `components/ambient-control-plane/cmd/ambient-control-plane/main.go`	Builds Kubernetes REST config, starts token server in its own goroutine, surfaces token-server errors via channels, and removes prior per-session token-refresh special-casing.
Reconciler / Kube `components/ambient-control-plane/internal/reconciler/kube_reconciler.go`	Removed per-session Secret creation, token volumes/mounts, and token-refresh loops. Added `CPTokenURL` to reconciler config; injects `AMBIENT_CP_TOKEN_URL` into runner and MCP sidecar env. Enabled automountServiceAccountToken and added optional runtime-namespace NetworkPolicy reconciliation.
Config `components/ambient-control-plane/internal/config/config.go`	Added `CPTokenListenAddr` (default `:8080`) and `CPTokenURL` fields populated from `CP_TOKEN_LISTEN_ADDR` and `CP_TOKEN_URL` env vars.
KubeClient `components/ambient-control-plane/internal/kubeclient/kubeclient.go`	Added `NetworkPolicyGVR` and methods to Get/Create NetworkPolicy via dynamic client to support network-policy reconciliation.
Dependencies `components/ambient-control-plane/go.mod`	Added `k8s.io/api v0.34.0` and related indirect OpenAPI/spec dependencies.
Runner gRPC Client (Python) `components/runners/ambient-runner/ambient_runner/_grpc_client.py`	Replaced secret-file token read with CP exchange: read pod SA token, call CP `AMBIENT_CP_TOKEN_URL` to obtain API token, add retries and validation, and update reconnect/from_env flows to use CP endpoint when configured.
Docs `docs/internal/design/control-plane.spec.md`	Documented CP token endpoint design, runner SA token flow, and shift from secret-based async renewal to synchronous CP token polling.

Sequence Diagram

sequenceDiagram
    participant Runner as Runner Pod
    participant K8s as Kubernetes API
    participant CP as CP Token Server
    participant TP as TokenProvider

    Runner->>Runner: Read SA token (mounted)
    Runner->>CP: GET /token\nAuthorization: Bearer {SA_token}
    CP->>K8s: TokenReview(SA_token)
    K8s-->>CP: {authenticated, user: runner-sa}
    CP->>CP: Validate runner SA name
    CP->>TP: Token(ctx)
    TP-->>CP: {api_token, expires_at}
    CP-->>Runner: {"token": api_token}
    Runner->>Runner: Use api_token for gRPC auth

Important

Pre-merge checks failed

Please resolve all errors before merging. Addressing warnings is optional.

❌ Failed checks (1 error, 2 warnings)

Check name	Status	Explanation	Resolution
Security And Secret Handling	❌ Error	ServiceAccount token proxy leakage vulnerability unresolved: urllib.request.urlopen() at line 75 in _grpc_client.py still honors *_PROXY environment variables, exposing pod SA token to interception.	Implement proxyless opener using urllib.request.build_opener(urllib.request.ProxyHandler({})) and replace urlopen() calls with _CP_TOKEN_OPENER.open() for token endpoint requests.
Docstring Coverage	⚠️ Warning	Docstring coverage is 26.32% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Kubernetes Resource Safety	⚠️ Warning	Pod resources lack CPU/memory limits, security contexts, and NetworkPolicy objects lack OwnerReferences.	Add resource limits/requests to containers; implement pod SecurityContext with RunAsNonRoot and ReadOnlyRootFilesystem; set OwnerReferences on NetworkPolicy objects.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	Title follows Conventional Commits format (feat(control-plane): ...), includes JIRA key RHOAIENG-56711, and accurately describes the main changeset: introducing a new CP /token endpoint for runner authentication.
Performance And Algorithmic Complexity	✅ Passed	PR introduces no performance regressions or algorithmic complexity issues. All K8s API operations are properly scoped, token fetch retries bounded to 3 attempts with exponential backoff, no N+1 patterns, and goroutines correctly managed.

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/RHOAIENG-56711-cp-token-endpoint

✨ Simplify code

Create PR with simplified code
Commit simplified code in branch feat/RHOAIENG-56711-cp-token-endpoint

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

markturansky

Summary

This is a well-architected solution to a real production issue. The shift from async Secret-push to synchronous token-pull eliminates the 3-way race (CP write → kubelet propagate → runner read) that was causing UNAUTHENTICATED loops. The implementation is clean, removes significant complexity (~200 LOC deleted), and uses K8s-native authentication properly.

Strengths

✅ Clear problem statement: The async race between OIDC token TTL, Secret propagation delay, and runner read timing is well-documented and motivated
✅ Significant cleanup: Removes ensureSecret, StartTokenRefreshLoop, refreshRunnerToken, and refreshAllRunningTokens — entire refresh mechanism eliminated
✅ Security: K8s TokenReview API validation is the correct approach for SA token verification
✅ Developer experience: Fallback to BOT_TOKEN env var preserves local dev workflow
✅ Documentation: Excellent spec in control-plane.spec.md explaining rationale and design decisions
✅ Production ready: Proper timeouts (TokenReview: 10s, HTTP server read/write: 10s), health endpoint, graceful shutdown

Issues & Questions

1. Goroutine error handling (main.go:180-191)

errCh := make(chan error, 1)
go func() {
    if err := startTokenServer(ctx, cfg, tokenProvider); err != nil {
        errCh <- err
    }
}()

infErr := inf.Run(ctx)
select {
case tsErr := <-errCh:
    return fmt.Errorf("token server: %w", tsErr)
default:
}

Issue: This only catches startup errors. If the token server starts successfully but crashes later (e.g., K8s client fails after 5 minutes), that error is silently lost. The select runs once after inf.Run() returns, but by then the token server goroutine may have exited long ago.

Suggestion: Either:

Use an error group pattern (golang.org/x/sync/errgroup) to properly coordinate goroutine lifetimes
Or accept that token server crashes are non-fatal and just log them (since runners will retry)

2. MCP sidecar compatibility (kube_reconciler.go:453)

envVar("AMBIENT_CP_TOKEN_URL", r.cfg.CPTokenURL),

The MCP sidecar now receives AMBIENT_CP_TOKEN_URL instead of BOT_TOKEN, but I don't see corresponding changes to the MCP sidecar implementation. Does the MCP sidecar code support fetching tokens from the CP endpoint, or does it still expect AMBIENT_TOKEN env var to contain the actual token?

Suggestion: Verify MCP sidecar compatibility or document that this is a follow-up PR.

3. No retry logic in runner (_grpc_client.py:15-42)

def _fetch_token_from_cp(cp_token_url: str) -> str:
    ...
    with urllib.request.urlopen(req, timeout=10) as resp:
        body = json.loads(resp.read())

If the CP token endpoint is temporarily unavailable (e.g., during a rolling update), the runner will crash on startup or fail on reconnect with no retry. This makes deployments brittle.

Suggestion: Add exponential backoff retry (e.g., 3 attempts with 1s/2s/4s delays) for transient network issues.

4. Unused token expiration field (tokenserver/handler.go:597)

type tokenResponse struct {
    Token     string `json:"token"`
    ExpiresAt string `json:"expires_at,omitempty"`
}

The response includes expires_at but it's always empty and the runner doesn't use it. This could enable proactive token refresh before expiry.

Suggestion: Either populate expires_at from OIDCTokenProvider metadata, or remove the field if not needed.

5. Missing test coverage

No unit tests for:

internal/tokenserver package (TokenReview validation, SA pattern matching, error cases)
Python _fetch_token_from_cp() function
Integration test for token server → runner auth flow

The test plan in the PR description is entirely manual.

Suggestion: Add at minimum:

Unit tests for isRunnerSA() with various username patterns
Unit test for handleToken() with mocked TokenReview responses
Python unit test for token fetch with mocked HTTP responses

6. HTTP error handling in Python (_grpc_client.py:34)

with urllib.request.urlopen(req, timeout=10) as resp:
    body = json.loads(resp.read())

This will raise urllib.error.HTTPError for 4xx/5xx responses, but the exception message won't include the response body (which likely contains the CP's error message like "forbidden" or "unauthorized").

Suggestion:

try:
    with urllib.request.urlopen(req, timeout=10) as resp:
        body = json.loads(resp.read())
except urllib.error.HTTPError as e:
    error_body = e.read().decode('utf-8', errors='ignore')
    raise RuntimeError(f"CP token endpoint returned {e.code}: {error_body}") from e

7. Hardcoded token fetch timeout

urllib.request.urlopen(req, timeout=10) uses a 10s timeout, but this isn't configurable. In slow clusters or during CP restarts, this might be too aggressive.

Suggestion: Make timeout configurable via env var (e.g., AMBIENT_CP_TOKEN_TIMEOUT_SECS) with 10s default.

Minor observations

go.mod growth: Adds k8s.io/api and ~10 transitive deps. Necessary but worth noting for dependency audits.
Documentation accuracy: The spec says "Status: 🔲 planned — RHOAIENG-56711" but should be updated to "✅ implemented" when merged.
Service pattern brittleness: isRunnerSA() assumes session-*-sa naming. If SA naming changes, update this pattern or make it configurable.
Token server port: Hardcoded :8080 conflicts with many default ports. Configurable via CP_TOKEN_LISTEN_ADDR but worth documenting the port choice.

Recommendation

Approve with minor fixes for issues #3 (retry logic) and #6 (HTTP error messages). Issues #1, #2, and #5 can be follow-up PRs if time-constrained.

This PR significantly improves production reliability by eliminating a real race condition. The architecture is sound and the implementation is clean. Great work! 🎉

markturansky

Summary

This is a well-architected solution to a real production issue. The shift from async Secret-push to synchronous token-pull eliminates the 3-way race (CP write → kubelet propagate → runner read) that was causing UNAUTHENTICATED loops. The implementation is clean, removes significant complexity (~200 LOC deleted), and uses K8s-native authentication properly.

Strengths

✅ Clear problem statement: The async race between OIDC token TTL, Secret propagation delay, and runner read timing is well-documented and motivated
✅ Significant cleanup: Removes ensureSecret, StartTokenRefreshLoop, refreshRunnerToken, and refreshAllRunningTokens — entire refresh mechanism eliminated
✅ Security: K8s TokenReview API validation is the correct approach for SA token verification
✅ Developer experience: Fallback to BOT_TOKEN env var preserves local dev workflow
✅ Documentation: Excellent spec in control-plane.spec.md explaining rationale and design decisions
✅ Production ready: Proper timeouts (TokenReview: 10s, HTTP server read/write: 10s), health endpoint, graceful shutdown

Issues & Questions

1. Goroutine error handling (main.go:180-191)

errCh := make(chan error, 1)
go func() {
    if err := startTokenServer(ctx, cfg, tokenProvider); err != nil {
        errCh <- err
    }
}()

infErr := inf.Run(ctx)
select {
case tsErr := <-errCh:
    return fmt.Errorf("token server: %w", tsErr)
default:
}

Issue: This only catches startup errors. If the token server starts successfully but crashes later (e.g., K8s client fails after 5 minutes), that error is silently lost. The select runs once after inf.Run() returns, but by then the token server goroutine may have exited long ago.

Suggestion: Either:

Use an error group pattern (golang.org/x/sync/errgroup) to properly coordinate goroutine lifetimes
Or accept that token server crashes are non-fatal and just log them (since runners will retry)

2. MCP sidecar compatibility (kube_reconciler.go:453)

envVar("AMBIENT_CP_TOKEN_URL", r.cfg.CPTokenURL),

The MCP sidecar now receives AMBIENT_CP_TOKEN_URL instead of BOT_TOKEN, but I don't see corresponding changes to the MCP sidecar implementation. Does the MCP sidecar code support fetching tokens from the CP endpoint, or does it still expect AMBIENT_TOKEN env var to contain the actual token?

Suggestion: Verify MCP sidecar compatibility or document that this is a follow-up PR.

3. No retry logic in runner (_grpc_client.py:15-42)

def _fetch_token_from_cp(cp_token_url: str) -> str:
    ...
    with urllib.request.urlopen(req, timeout=10) as resp:
        body = json.loads(resp.read())

If the CP token endpoint is temporarily unavailable (e.g., during a rolling update), the runner will crash on startup or fail on reconnect with no retry. This makes deployments brittle.

Suggestion: Add exponential backoff retry (e.g., 3 attempts with 1s/2s/4s delays) for transient network issues.

4. Unused token expiration field (tokenserver/handler.go:597)

type tokenResponse struct {
    Token     string `json:"token"`
    ExpiresAt string `json:"expires_at,omitempty"`
}

The response includes expires_at but it's always empty and the runner doesn't use it. This could enable proactive token refresh before expiry.

Suggestion: Either populate expires_at from OIDCTokenProvider metadata, or remove the field if not needed.

5. Missing test coverage

No unit tests for:

internal/tokenserver package (TokenReview validation, SA pattern matching, error cases)
Python _fetch_token_from_cp() function
Integration test for token server → runner auth flow

The test plan in the PR description is entirely manual.

Suggestion: Add at minimum:

Unit tests for isRunnerSA() with various username patterns
Unit test for handleToken() with mocked TokenReview responses
Python unit test for token fetch with mocked HTTP responses

6. HTTP error handling in Python (_grpc_client.py:34)

with urllib.request.urlopen(req, timeout=10) as resp:
    body = json.loads(resp.read())

This will raise urllib.error.HTTPError for 4xx/5xx responses, but the exception message won't include the response body (which likely contains the CP's error message like "forbidden" or "unauthorized").

Suggestion:

try:
    with urllib.request.urlopen(req, timeout=10) as resp:
        body = json.loads(resp.read())
except urllib.error.HTTPError as e:
    error_body = e.read().decode('utf-8', errors='ignore')
    raise RuntimeError(f"CP token endpoint returned {e.code}: {error_body}") from e

7. Hardcoded token fetch timeout

urllib.request.urlopen(req, timeout=10) uses a 10s timeout, but this isn't configurable. In slow clusters or during CP restarts, this might be too aggressive.

Suggestion: Make timeout configurable via env var (e.g., AMBIENT_CP_TOKEN_TIMEOUT_SECS) with 10s default.

Minor observations

go.mod growth: Adds k8s.io/api and ~10 transitive deps. Necessary but worth noting for dependency audits.
Documentation accuracy: The spec says "Status: 🔲 planned — RHOAIENG-56711" but should be updated to "✅ implemented" when merged.
Service pattern brittleness: isRunnerSA() assumes session-*-sa naming. If SA naming changes, update this pattern or make it configurable.
Token server port: Hardcoded :8080 conflicts with many default ports. Configurable via CP_TOKEN_LISTEN_ADDR but worth documenting the port choice.

Recommendation

Approve with minor fixes for issues #3 (retry logic) and #6 (HTTP error messages). Issues #1, #2, and #5 can be follow-up PRs if time-constrained.

This PR significantly improves production reliability by eliminating a real race condition. The architecture is sound and the implementation is clean. Great work! 🎉

coderabbitai

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@components/ambient-control-plane/cmd/ambient-control-plane/main.go`:
- Around line 185-198: runKubeMode currently starts startTokenServer in a
goroutine and then calls inf.Run(ctx) but never observes errCh while inf.Run is
running, so a token server failure is dropped; change runKubeMode to propagate
token server errors by selecting between the token error channel and the
informer run (or cancelling the context) — for example, start startTokenServer
as you do but then either (a) run inf.Run in a goroutine and select on errCh and
a result channel from inf.Run, or (b) after starting both goroutines, use a
select that waits for tsErr from errCh and returns fmt.Errorf("token server:
%w", tsErr) (or cancels ctx) if received, and otherwise returns infErr when
inf.Run completes; reference runKubeMode, startTokenServer, errCh, and inf.Run
to locate the change.

In `@components/ambient-control-plane/internal/config/config.go`:
- Around line 76-77: The CPTokenURL field (CPTokenURL) is currently optional
(os.Getenv) which leaves pods without a credential source; change the config to
require AMBIENT_CP_TOKEN_URL by replacing the optional read with a required
lookup and failing fast when empty: use the existing envOrDefault helper or add
an envMust helper to read "CP_TOKEN_URL" (or "AMBIENT_CP_TOKEN_URL" if that's
the reconciler-injected name) and return an error from the config
constructor/newConfig function when the value is empty so the process exits
early; update any callers of the config constructor to handle the error
accordingly.

In `@components/ambient-control-plane/internal/tokenserver/handler.go`:
- Around line 56-58: The current check in the token handler (isRunnerSA(...) and
the branches that log and return forbidden) is too weak; instead resolve the
namespace/name from username, fetch the ServiceAccount via the Kubernetes client
available on the handler (e.g., h.kubeClient or the handler method that
processes the request), and verify this SA is a CP-managed session SA before
minting the token. Concretely: in the token handling code paths that call
isRunnerSA (and the separate block around lines 99-108), parse username to get
namespace and name, use the Kubernetes API to GET the ServiceAccount, and
validate it was created/owned by the control plane (check a reconciler-set label
or ownerReference) before proceeding to mint; if the SA is missing or not
CP-managed, return http.Error(w, "forbidden", http.StatusForbidden) and log via
h.logger.

In `@components/runners/ambient-runner/ambient_runner/_grpc_client.py`:
- Around line 3-6: Validate the AMBIENT_CP_TOKEN_URL before calling
urllib.request.urlopen: parse the AMBIENT_CP_TOKEN_URL variable (use
urllib.parse.urlparse) and ensure the scheme is a safe allowed value (e.g.,
"https") and host is non-empty; if the scheme is not allowed, log an error via
logging and skip/abort sending the Kubernetes SA token instead of calling
urllib.request.urlopen. Update the code paths that reference
AMBIENT_CP_TOKEN_URL and urllib.request.urlopen to perform this check and fail
closed to prevent token exfiltration.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 8377f68e-8fe2-4a78-92c0-9ec27cfb444d

📥 Commits

Reviewing files that changed from the base of the PR and between f7ef7d7 and 660c0f7.

⛔ Files ignored due to path filters (1)

components/ambient-control-plane/go.sum is excluded by !**/*.sum, !**/go.sum

📒 Files selected for processing (8)

components/ambient-control-plane/cmd/ambient-control-plane/main.go
components/ambient-control-plane/go.mod
components/ambient-control-plane/internal/config/config.go
components/ambient-control-plane/internal/reconciler/kube_reconciler.go
components/ambient-control-plane/internal/tokenserver/handler.go
components/ambient-control-plane/internal/tokenserver/server.go
components/runners/ambient-runner/ambient_runner/_grpc_client.py
docs/internal/design/control-plane.spec.md

components/ambient-control-plane/cmd/ambient-control-plane/main.go

coderabbitai · 2026-04-04T22:24:38Z

components/ambient-control-plane/internal/config/config.go

+		CPTokenListenAddr:    envOrDefault("CP_TOKEN_LISTEN_ADDR", ":8080"),
+		CPTokenURL:           os.Getenv("CP_TOKEN_URL"),


⚠️ Potential issue | 🟠 Major

Make CP_TOKEN_URL required for kube session pods.

Line 77 leaves the new auth path optional, but the reconciler now injects only AMBIENT_CP_TOKEN_URL, and the runner falls back to BOT_TOKEN only when that value is empty. Since this PR removes BOT_TOKEN injection, omitting CP_TOKEN_URL leaves provisioned runner/MCP pods with no credential source and they fail auth on startup/reconnect.

Suggested fix

switch cfg.PlatformMode { case "standard", "mpp": default: return nil, fmt.Errorf("unknown PLATFORM_MODE %q: must be one of standard, mpp", cfg.PlatformMode) } + + if cfg.Mode == "kube" && cfg.CPTokenURL == "" { + for _, name := range cfg.Reconcilers { + if name == "kube" { + return nil, fmt.Errorf("CP_TOKEN_URL must be set when the kube reconciler is enabled") + } + } + } return cfg, nil }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@components/ambient-control-plane/internal/config/config.go` around lines 76 - 77, The CPTokenURL field (CPTokenURL) is currently optional (os.Getenv) which leaves pods without a credential source; change the config to require AMBIENT_CP_TOKEN_URL by replacing the optional read with a required lookup and failing fast when empty: use the existing envOrDefault helper or add an envMust helper to read "CP_TOKEN_URL" (or "AMBIENT_CP_TOKEN_URL" if that's the reconciler-injected name) and return an error from the config constructor/newConfig function when the value is empty so the process exits early; update any callers of the config constructor to handle the error accordingly.

coderabbitai · 2026-04-04T22:24:38Z

components/ambient-control-plane/internal/tokenserver/handler.go

+	if !isRunnerSA(username) {
+		h.logger.Warn().Str("username", username).Msg("token request: username does not match runner SA pattern")
+		http.Error(w, "forbidden", http.StatusForbidden)


⚠️ Potential issue | 🔴 Critical

Name-pattern auth is too weak for /token.

Lines 56-58 and 99-108 mint the API token for any authenticated service account whose username matches system:serviceaccount:*:session-*-sa. TokenReview proves the caller owns some service account, but not that it is one created by this control plane. Any workload able to run under or create a matching SA in any namespace can obtain the API token.

Please bind issuance to CP-managed session SAs before returning a token, e.g. by resolving the namespace/name from username and verifying the ServiceAccount is one the reconciler created.

Also applies to: 99-108

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@components/ambient-control-plane/internal/tokenserver/handler.go` around lines 56 - 58, The current check in the token handler (isRunnerSA(...) and the branches that log and return forbidden) is too weak; instead resolve the namespace/name from username, fetch the ServiceAccount via the Kubernetes client available on the handler (e.g., h.kubeClient or the handler method that processes the request), and verify this SA is a CP-managed session SA before minting the token. Concretely: in the token handling code paths that call isRunnerSA (and the separate block around lines 99-108), parse username to get namespace and name, use the Kubernetes API to GET the ServiceAccount, and validate it was created/owned by the control plane (check a reconciler-set label or ownerReference) before proceeding to mint; if the SA is missing or not CP-managed, return http.Error(w, "forbidden", http.StatusForbidden) and log via h.logger.

components/runners/ambient-runner/ambient_runner/_grpc_client.py

… runner pods Session namespace NetworkPolicies (provisioned by the MPP tenant operator) block cross-namespace ingress by default. The api-server pod (in the CP runtime namespace) could not reach runner pod port 8001, causing `acpctl session events` to receive a 502 from the ingress. Add ensureAPIServerNetworkPolicy() called from ensureNamespaceExists(): creates a NetworkPolicy named allow-ambient-api-server in each session namespace that allows TCP:8001 ingress from pods in the CP runtime namespace (matched by kubernetes.io/metadata.name label). Idempotent — skips creation if the policy already exists. Also add NetworkPolicyGVR, GetNetworkPolicy, and CreateNetworkPolicy to KubeClient to support the new operation. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

markturansky · 2026-04-04T23:04:28Z

Follow-up thoughts

After diving deeper into this change, I want to highlight what makes this PR particularly elegant from a systems design perspective:

The core insight: Token propagation delays aren't a bug to be optimized away — they're inherent to the Secret-push model. By flipping to synchronous pull, you've eliminated an entire class of failure (race conditions) rather than trying to tune timeouts and retry logic around it.

However, there's one critical gap that could bite in production: the runner has no retry logic when calling the CP token endpoint (_grpc_client.py:832-841). If the CP is mid-rolling-update during a runner reconnect, the fetch fails immediately and the runner enters an error loop.

Since the entire value prop here is reliability (eliminating races), it feels inconsistent to introduce a new single-point-of-failure without retry. A simple exponential backoff (3 attempts, 1s/2s/4s) would make this bulletproof for transient CP unavailability during deployments.

Minor question: The MCP sidecar gets AMBIENT_CP_TOKEN_URL injected but I don't see MCP sidecar code changes in this PR. Does it already support this pattern, or is that a follow-up PR?

Everything else looks production-ready. The TokenReview validation, PKCE challenge, graceful shutdown — all the hard parts are done correctly. Just needs that retry safety net.

…1213 main.go: run informer and token server concurrently in separate goroutines and select on both error channels — a token server crash now immediately fails runKubeMode instead of being silently swallowed after inf.Run returns. tokenserver/handler.go: remove dead expires_at field from tokenResponse; OIDCTokenProvider does not expose expiry metadata and the runner never read it. _grpc_client.py: - Validate AMBIENT_CP_TOKEN_URL scheme (http/https only, no credentials) before sending the pod SA token, preventing exfiltration via misconfigured or malicious URLs (file://, ftp://, http://user:pass@...). - Add exponential backoff retry (attempts: 1, backoff: 1s/2s) around _fetch_token_from_cp() to handle transient CP unavailability at runner startup or gRPC reconnect. - Catch urllib.error.HTTPError separately and include the response body in the RuntimeError message for debuggability. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

components/ambient-control-plane/internal/tokenserver/handler.go (1)
55-61: ⚠️ Potential issue | 🔴 Critical

Authorize against a CP-managed ServiceAccount, not just a name pattern.

Lines 55-61 only check system:serviceaccount:*:session-*-sa before minting the API token. TokenReview proves the caller owns some ServiceAccount, but any workload that can run as or create a matching SA in any namespace can pass this gate and get a token. Parse the namespace/name from username and verify the ServiceAccount is one the reconciler created (for example via a label or owner reference) before calling Token().
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@components/ambient-control-plane/internal/tokenserver/handler.go` around
lines 55 - 61, The handler currently only validates the username pattern via
isRunnerSA(username) before minting an API token; instead parse the
ServiceAccount namespace/name from username (the TokenReview-provided string),
then verify that this ServiceAccount is managed by the control-plane reconciler
(e.g., check the SA's labels or ownerReferences via the k8s client) before
calling h.tokenProvider.Token(r.Context()); update the logic in the handler
around isRunnerSA, username parsing, and the pre-Token() check to reject
requests where the parsed ServiceAccount is not owned/labeled by the reconciler.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@components/runners/ambient-runner/ambient_runner/_grpc_client.py`:
- Around line 71-75: The code uses urllib.request.urlopen(req, ...) which honors
*_PROXY env vars and can leak the ServiceAccount bearer token (sa_token) through
proxies; replace that call with a proxyless opener created via
urllib.request.build_opener(urllib.request.ProxyHandler({})) and call
opener.open(req, timeout=_CP_TOKEN_FETCH_TIMEOUT) so the cp_token_url request
(where sa_token is set in headers) bypasses any HTTP proxies. Locate the request
construction and the urlopen call around cp_token_url / sa_token /
_CP_TOKEN_FETCH_TIMEOUT in _grpc_client.py and use the proxyless opener only for
this internal auth fetch.

---

Duplicate comments:
In `@components/ambient-control-plane/internal/tokenserver/handler.go`:
- Around line 55-61: The handler currently only validates the username pattern
via isRunnerSA(username) before minting an API token; instead parse the
ServiceAccount namespace/name from username (the TokenReview-provided string),
then verify that this ServiceAccount is managed by the control-plane reconciler
(e.g., check the SA's labels or ownerReferences via the k8s client) before
calling h.tokenProvider.Token(r.Context()); update the logic in the handler
around isRunnerSA, username parsing, and the pre-Token() check to reject
requests where the parsed ServiceAccount is not owned/labeled by the reconciler.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 70b1fd11-4b30-422d-84d7-4d1a74d16dd7

📥 Commits

Reviewing files that changed from the base of the PR and between 8833cce and 42092fe.

📒 Files selected for processing (3)

components/ambient-control-plane/cmd/ambient-control-plane/main.go
components/ambient-control-plane/internal/tokenserver/handler.go
components/runners/ambient-runner/ambient_runner/_grpc_client.py

coderabbitai · 2026-04-04T23:29:19Z

components/runners/ambient-runner/ambient_runner/_grpc_client.py

+            req = urllib.request.Request(
+                cp_token_url,
+                headers={"Authorization": f"Bearer {sa_token}"},
+            )
+            with urllib.request.urlopen(req, timeout=_CP_TOKEN_FETCH_TIMEOUT) as resp:


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cat -n components/runners/ambient-runner/ambient_runner/_grpc_client.py | head -100

Repository: ambient-code/platform

Length of output: 4134

🏁 Script executed:

rg "ProxyHandler|NO_PROXY|_PROXY" components/runners/ambient-runner/ --no-heading

Repository: ambient-code/platform

Length of output: 47

🏁 Script executed:

git ls-files "components/runners/ambient-runner/**/*.py" | head -20

Repository: ambient-code/platform

Length of output: 1326

🏁 Script executed:

# Check if there are any other proxy handling mechanisms in the file cat -n components/runners/ambient-runner/ambient_runner/_grpc_client.py | sed -n '1,150p'

Repository: ambient-code/platform

Length of output: 6209

🏁 Script executed:

# Check for any proxy configuration or environment handling in the entire runner rg "http_proxy|https_proxy|NO_PROXY|ProxyHandler" components/runners/ambient-runner/ -i

Repository: ambient-code/platform

Length of output: 47

🏁 Script executed:

# Look for any existing documentation or tests related to proxy handling find components/runners/ambient-runner -name "*.py" -type f -exec grep -l "proxy\|Proxy" {} \;

Repository: ambient-code/platform

Length of output: 337

🏁 Script executed:

# Check the files that mention proxy to see context cat -n components/runners/ambient-runner/ambient_runner/endpoints/run.py | grep -C 5 -i proxy

Repository: ambient-code/platform

Length of output: 733

🏁 Script executed:

cat -n components/runners/ambient-runner/ambient_runner/platform/auth.py | grep -C 5 -i proxy

Repository: ambient-code/platform

Length of output: 783

🏁 Script executed:

# Check how AMBIENT_CP_TOKEN_URL is used/documented rg "AMBIENT_CP_TOKEN_URL" components/runners/ambient-runner/ -B 2 -A 2

Repository: ambient-code/platform

Length of output: 573

🏁 Script executed:

# Check how the cp_token_url is typically set and what value it might have rg "cp_token_url\|AMBIENT_CP_TOKEN_URL" components/runners/ambient-runner/ -B 2 -A 2

Repository: ambient-code/platform

Length of output: 47

🏁 Script executed:

# Check if there are any tests or examples showing typical usage find components/runners/ambient-runner -name "test*.py" -o -name "*_test.py" | head -5

Repository: ambient-code/platform

Length of output: 379

🏁 Script executed:

# Search more broadly for cp_token_url usage rg "cp_token_url" components/runners/ambient-runner/ -B 3 -A 3

Repository: ambient-code/platform

Length of output: 6945

🏁 Script executed:

# Check the full initialization code to see how cp_token_url gets passed cat -n components/runners/ambient-runner/ambient_runner/_grpc_client.py | sed -n '136,180p'

Repository: ambient-code/platform

Length of output: 2030

🏁 Script executed:

# Look for environment variable usage or configuration rg "os.environ\|os.getenv" components/runners/ambient-runner/ambient_runner/_grpc_client.py -B 2 -A 2

Repository: ambient-code/platform

Length of output: 47

Prevent ServiceAccount token leakage through HTTP proxies.

Line 75 uses urllib.request.urlopen(), which honors *_PROXY environment variables by default. If the runner pod inherits proxy configuration without NO_PROXY covering the control-plane service, the ServiceAccount bearer token (line 73) gets routed through that proxy on every token refresh, exposing the credential. Use a proxyless opener for this internal auth request.

Suggested fix

+_CP_TOKEN_OPENER = urllib.request.build_opener(urllib.request.ProxyHandler({})) + def _fetch_token_from_cp(cp_token_url: str) -> str: """Fetch a fresh API token from the CP /token endpoint using the pod SA token. ... """ _validate_cp_token_url(cp_token_url) ... - with urllib.request.urlopen(req, timeout=_CP_TOKEN_FETCH_TIMEOUT) as resp: + with _CP_TOKEN_OPENER.open(req, timeout=_CP_TOKEN_FETCH_TIMEOUT) as resp: body = json.loads(resp.read())

🧰 Tools

🪛 Ruff (0.15.9)

[error] 71-74: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.

(S310)

[error] 75-75: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.

(S310)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@components/runners/ambient-runner/ambient_runner/_grpc_client.py` around lines 71 - 75, The code uses urllib.request.urlopen(req, ...) which honors *_PROXY env vars and can leak the ServiceAccount bearer token (sa_token) through proxies; replace that call with a proxyless opener created via urllib.request.build_opener(urllib.request.ProxyHandler({})) and call opener.open(req, timeout=_CP_TOKEN_FETCH_TIMEOUT) so the cp_token_url request (where sa_token is set in headers) bypasses any HTTP proxies. Locate the request construction and the urlopen call around cp_token_url / sa_token / _CP_TOKEN_FETCH_TIMEOUT in _grpc_client.py and use the proxyless opener only for this internal auth fetch.

…CP_TOKEN_URL (#1214) ## Summary Follow-up to #1213. The CP token endpoint was running but unreachable because: 1. **No Service** — port 8080 (token server) had no ClusterIP Service, so runner pods had no DNS target and the NetworkPolicy peer had no stable reference 2. **Wrong `CP_RUNTIME_NAMESPACE`** — defaulted to `ambient-code--runtime-int` but the actual deployed namespace is `ambient-code--ambient-s0`, so `ensureAPIServerNetworkPolicy()` was creating a NetworkPolicy that matched the wrong namespace selector — causing `acpctl session events` to still 502 ## Changes - `ambient-control-plane-svc.yaml` — new ClusterIP Service exposing port 8080 on the CP pod - `ambient-control-plane.yaml` — inject `CP_RUNTIME_NAMESPACE` via downward API (`metadata.namespace`) so the NetworkPolicy peer label matches the actual runtime namespace; set `CP_TOKEN_URL` to the FQDN of the new Service ## Test plan - [ ] Deploy to int spoke - [ ] Verify `oc get svc ambient-control-plane -n ambient-code--ambient-s0` exists with port 8080 - [ ] Start new session; verify `allow-ambient-api-server` NetworkPolicy in session namespace uses correct namespace selector (`ambient-code--ambient-s0`) - [ ] `acpctl session events <id>` streams without 502 🤖 Generated with [Claude Code](https://claude.ai/code)  ## Summary by CodeRabbit * **New Features** * Added ambient control plane service with token authentication capability * Configured control plane to integrate with token service endpoint for runtime authentication  Co-authored-by: Ambient Code Bot <bot@ambient-code.local> Co-authored-by: Claude <noreply@anthropic.com>

Ambient Code Bot and others added 3 commits April 4, 2026 13:03

markturansky commented Apr 4, 2026

View reviewed changes

coderabbitai bot reviewed Apr 4, 2026

View reviewed changes

markturansky merged commit 00c1a24 into alpha Apr 4, 2026
44 checks passed

markturansky deleted the feat/RHOAIENG-56711-cp-token-endpoint branch April 4, 2026 23:33

markturansky mentioned this pull request Apr 5, 2026

fix(manifests): add CP token Service and wire CP_RUNTIME_NAMESPACE + CP_TOKEN_URL #1214

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(control-plane): CP /token endpoint for runner gRPC auth RHOAIENG-56711#1213

feat(control-plane): CP /token endpoint for runner gRPC auth RHOAIENG-56711#1213
markturansky merged 5 commits intoalphafrom
feat/RHOAIENG-56711-cp-token-endpoint

markturansky commented Apr 4, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Apr 4, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram

Pre-merge checks failed

Uh oh!

markturansky left a comment

Uh oh!

markturansky left a comment

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

coderabbitai bot Apr 4, 2026

Uh oh!

coderabbitai bot Apr 4, 2026

Uh oh!

Uh oh!

markturansky commented Apr 4, 2026

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Apr 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		CPTokenListenAddr: envOrDefault("CP_TOKEN_LISTEN_ADDR", ":8080"),
		CPTokenURL: os.Getenv("CP_TOKEN_URL"),

Conversation

markturansky commented Apr 4, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Design rationale

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Apr 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Pre-merge checks failed

❌ Failed checks (1 error, 2 warnings)

Uh oh!

markturansky left a comment

Choose a reason for hiding this comment

Summary

Strengths

Issues & Questions

1. Goroutine error handling (main.go:180-191)

2. MCP sidecar compatibility (kube_reconciler.go:453)

3. No retry logic in runner (_grpc_client.py:15-42)

4. Unused token expiration field (tokenserver/handler.go:597)

5. Missing test coverage

6. HTTP error handling in Python (_grpc_client.py:34)

7. Hardcoded token fetch timeout

Minor observations

Recommendation

Uh oh!

markturansky left a comment

Choose a reason for hiding this comment

Summary

Strengths

Issues & Questions

1. Goroutine error handling (main.go:180-191)

2. MCP sidecar compatibility (kube_reconciler.go:453)

3. No retry logic in runner (_grpc_client.py:15-42)

4. Unused token expiration field (tokenserver/handler.go:597)

5. Missing test coverage

6. HTTP error handling in Python (_grpc_client.py:34)

7. Hardcoded token fetch timeout

Minor observations

Recommendation

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Apr 4, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 4, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

markturansky commented Apr 4, 2026

Follow-up thoughts

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 4, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

markturansky commented Apr 4, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Apr 4, 2026 •

edited

Loading