Skip to content

feat(control-plane): CP /token endpoint for runner gRPC auth RHOAIENG-56711#1213

Merged
markturansky merged 5 commits intoalphafrom
feat/RHOAIENG-56711-cp-token-endpoint
Apr 4, 2026
Merged

feat(control-plane): CP /token endpoint for runner gRPC auth RHOAIENG-56711#1213
markturansky merged 5 commits intoalphafrom
feat/RHOAIENG-56711-cp-token-endpoint

Conversation

@markturansky
Copy link
Copy Markdown
Contributor

@markturansky markturansky commented Apr 4, 2026

Summary

  • Eliminates async BOT_TOKEN Secret push/refresh loop — replaces with synchronous pull from a new CP HTTP endpoint
  • New internal/tokenserver packageGET /token validates the caller's K8s SA token via TokenReview, checks system:serviceaccount:*:session-*-sa pattern, returns a fresh API token via OIDCTokenProvider
  • Runner updated_fetch_token_from_cp() reads pod SA token from standard K8s mount, calls CP /token on startup and every gRPC reconnect; falls back to BOT_TOKEN env var for local dev when AMBIENT_CP_TOKEN_URL unset
  • kube_reconciler cleaned up — removes ensureSecret, StartTokenRefreshLoop, refreshRunnerToken, refreshAllRunningTokens; sets automountServiceAccountToken: true; injects AMBIENT_CP_TOKEN_URL instead of BOT_TOKEN secret ref

Design rationale

CP is the thing rotating the secret — if it's down for /token it's down for cycling too. Synchronous pull eliminates the 3-way race (CP ticker → kubelet propagation → runner read) and removes a class of stale-token failures.

Spec: docs/internal/design/control-plane.spec.md

Test plan

  • Deploy CP with CP_TOKEN_LISTEN_ADDR=:8080 and CP_TOKEN_URL=http://<cp-svc>:8080/token
  • Start a session; confirm runner pod has automountServiceAccountToken: true and AMBIENT_CP_TOKEN_URL env set
  • Confirm runner fetches token on startup (log: [GRPC CLIENT] Fetched fresh API token from CP token endpoint)
  • Force gRPC reconnect; confirm fresh token fetched via CP endpoint
  • Verify GET /healthz on token server returns 200
  • Confirm non-runner SA token is rejected (403)
  • Local dev: unset AMBIENT_CP_TOKEN_URL, set BOT_TOKEN; confirm fallback path works

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Control plane exposes a new /token endpoint that issues API tokens on demand.
    • Runners now obtain tokens from the control plane (AMBIENT_CP_TOKEN_URL) using their ServiceAccount tokens; BOT_TOKEN secret is no longer used.
    • Control plane can optionally ensure a NetworkPolicy to allow API-server access between namespaces.
  • Configuration

    • Added CP token settings (CP_TOKEN_LISTEN_ADDR, CP_TOKEN_URL) to control plane config.
  • Documentation

    • Design doc updated to describe the CP token endpoint, runner auth flow, and security implications.

Ambient Code Bot and others added 3 commits April 4, 2026 13:03
…wal RHOAIENG-56711

Documents the synchronous token fetch endpoint on the CP as the
replacement for async Secret-write renewal. Covers the problem
(three-way async race against OIDC TTL), solution (GET /token with
K8s SA TokenReview auth), bootstrap vs renewal split, new
internal/tokenserver packages, runner reconnect changes, and
AMBIENT_CP_TOKEN_URL env var injection.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
CP creates the runner pod, so it is always up before the runner's first
token request. Retaining a Secret write loop adds complexity and a second
failure mode with identical blast radius. SA token automount flipped to
true; BOT_TOKEN removed from env vars table and bootstrap split removed.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
…h RHOAIENG-56711

Eliminates the async BOT_TOKEN Secret push+refresh loop in favour of a
synchronous pull model:

- Add internal/tokenserver package: HTTP server on :8080 with GET /token
  handler that validates the caller's K8s SA token via TokenReview, checks
  the runner SA pattern (system:serviceaccount:*:session-*-sa), then mints
  and returns a fresh API token via the existing OIDCTokenProvider.
- Wire token server goroutine into runKubeMode() alongside the informer.
- Add CPTokenListenAddr / CPTokenURL fields to ControlPlaneConfig and
  KubeReconcilerConfig; read from CP_TOKEN_LISTEN_ADDR / CP_TOKEN_URL env.
- kube_reconciler: remove ensureSecret, StartTokenRefreshLoop,
  refreshRunnerToken, refreshAllRunningTokens; replace BOT_TOKEN secret
  env injection with AMBIENT_CP_TOKEN_URL; set automountServiceAccountToken
  true on SA and pod so the K8s-mounted SA token is available at the
  standard path.
- runner _grpc_client.py: add _fetch_token_from_cp() that reads the pod SA
  token and calls the CP /token endpoint; from_env() calls it when
  AMBIENT_CP_TOKEN_URL is set, falls back to BOT_TOKEN env var for local
  dev; reconnect() refreshes via CP endpoint on every gRPC reconnect.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 4, 2026

📝 Walkthrough

Walkthrough

Adds a control-plane HTTP token server (GET /token) that validates Kubernetes ServiceAccount bearer tokens via TokenReview and mints API tokens. Removes per-session secret-based token injection/refresh; runners use their mounted SA token to synchronously request API tokens from the CP endpoint.

Changes

Cohort / File(s) Summary
Token Server
components/ambient-control-plane/internal/tokenserver/server.go, components/ambient-control-plane/internal/tokenserver/handler.go
New HTTP server and handler exposing /token and /healthz. Validates SA bearer tokens with TokenReview, enforces runner SA naming, and mints/returns API tokens via TokenProvider. Includes lifecycle (start/shutdown) and health endpoint.
Control Plane Main
components/ambient-control-plane/cmd/ambient-control-plane/main.go
Builds Kubernetes REST config, starts token server in its own goroutine, surfaces token-server errors via channels, and removes prior per-session token-refresh special-casing.
Reconciler / Kube
components/ambient-control-plane/internal/reconciler/kube_reconciler.go
Removed per-session Secret creation, token volumes/mounts, and token-refresh loops. Added CPTokenURL to reconciler config; injects AMBIENT_CP_TOKEN_URL into runner and MCP sidecar env. Enabled automountServiceAccountToken and added optional runtime-namespace NetworkPolicy reconciliation.
Config
components/ambient-control-plane/internal/config/config.go
Added CPTokenListenAddr (default :8080) and CPTokenURL fields populated from CP_TOKEN_LISTEN_ADDR and CP_TOKEN_URL env vars.
KubeClient
components/ambient-control-plane/internal/kubeclient/kubeclient.go
Added NetworkPolicyGVR and methods to Get/Create NetworkPolicy via dynamic client to support network-policy reconciliation.
Dependencies
components/ambient-control-plane/go.mod
Added k8s.io/api v0.34.0 and related indirect OpenAPI/spec dependencies.
Runner gRPC Client (Python)
components/runners/ambient-runner/ambient_runner/_grpc_client.py
Replaced secret-file token read with CP exchange: read pod SA token, call CP AMBIENT_CP_TOKEN_URL to obtain API token, add retries and validation, and update reconnect/from_env flows to use CP endpoint when configured.
Docs
docs/internal/design/control-plane.spec.md
Documented CP token endpoint design, runner SA token flow, and shift from secret-based async renewal to synchronous CP token polling.

Sequence Diagram

sequenceDiagram
    participant Runner as Runner Pod
    participant K8s as Kubernetes API
    participant CP as CP Token Server
    participant TP as TokenProvider

    Runner->>Runner: Read SA token (mounted)
    Runner->>CP: GET /token\nAuthorization: Bearer {SA_token}
    CP->>K8s: TokenReview(SA_token)
    K8s-->>CP: {authenticated, user: runner-sa}
    CP->>CP: Validate runner SA name
    CP->>TP: Token(ctx)
    TP-->>CP: {api_token, expires_at}
    CP-->>Runner: {"token": api_token}
    Runner->>Runner: Use api_token for gRPC auth
Loading

Important

Pre-merge checks failed

Please resolve all errors before merging. Addressing warnings is optional.

❌ Failed checks (1 error, 2 warnings)

Check name Status Explanation Resolution
Security And Secret Handling ❌ Error ServiceAccount token proxy leakage vulnerability unresolved: urllib.request.urlopen() at line 75 in _grpc_client.py still honors *_PROXY environment variables, exposing pod SA token to interception. Implement proxyless opener using urllib.request.build_opener(urllib.request.ProxyHandler({})) and replace urlopen() calls with _CP_TOKEN_OPENER.open() for token endpoint requests.
Docstring Coverage ⚠️ Warning Docstring coverage is 26.32% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Kubernetes Resource Safety ⚠️ Warning Pod resources lack CPU/memory limits, security contexts, and NetworkPolicy objects lack OwnerReferences. Add resource limits/requests to containers; implement pod SecurityContext with RunAsNonRoot and ReadOnlyRootFilesystem; set OwnerReferences on NetworkPolicy objects.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed Title follows Conventional Commits format (feat(control-plane): ...), includes JIRA key RHOAIENG-56711, and accurately describes the main changeset: introducing a new CP /token endpoint for runner authentication.
Performance And Algorithmic Complexity ✅ Passed PR introduces no performance regressions or algorithmic complexity issues. All K8s API operations are properly scoped, token fetch retries bounded to 3 attempts with exponential backoff, no N+1 patterns, and goroutines correctly managed.
✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/RHOAIENG-56711-cp-token-endpoint
✨ Simplify code
  • Create PR with simplified code
  • Commit simplified code in branch feat/RHOAIENG-56711-cp-token-endpoint

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor Author

@markturansky markturansky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary

This is a well-architected solution to a real production issue. The shift from async Secret-push to synchronous token-pull eliminates the 3-way race (CP write → kubelet propagate → runner read) that was causing UNAUTHENTICATED loops. The implementation is clean, removes significant complexity (~200 LOC deleted), and uses K8s-native authentication properly.

Strengths

Clear problem statement: The async race between OIDC token TTL, Secret propagation delay, and runner read timing is well-documented and motivated
Significant cleanup: Removes ensureSecret, StartTokenRefreshLoop, refreshRunnerToken, and refreshAllRunningTokens — entire refresh mechanism eliminated
Security: K8s TokenReview API validation is the correct approach for SA token verification
Developer experience: Fallback to BOT_TOKEN env var preserves local dev workflow
Documentation: Excellent spec in control-plane.spec.md explaining rationale and design decisions
Production ready: Proper timeouts (TokenReview: 10s, HTTP server read/write: 10s), health endpoint, graceful shutdown

Issues & Questions

1. Goroutine error handling (main.go:180-191)

errCh := make(chan error, 1)
go func() {
    if err := startTokenServer(ctx, cfg, tokenProvider); err != nil {
        errCh <- err
    }
}()

infErr := inf.Run(ctx)
select {
case tsErr := <-errCh:
    return fmt.Errorf("token server: %w", tsErr)
default:
}

Issue: This only catches startup errors. If the token server starts successfully but crashes later (e.g., K8s client fails after 5 minutes), that error is silently lost. The select runs once after inf.Run() returns, but by then the token server goroutine may have exited long ago.

Suggestion: Either:

  • Use an error group pattern (golang.org/x/sync/errgroup) to properly coordinate goroutine lifetimes
  • Or accept that token server crashes are non-fatal and just log them (since runners will retry)

2. MCP sidecar compatibility (kube_reconciler.go:453)

envVar("AMBIENT_CP_TOKEN_URL", r.cfg.CPTokenURL),

The MCP sidecar now receives AMBIENT_CP_TOKEN_URL instead of BOT_TOKEN, but I don't see corresponding changes to the MCP sidecar implementation. Does the MCP sidecar code support fetching tokens from the CP endpoint, or does it still expect AMBIENT_TOKEN env var to contain the actual token?

Suggestion: Verify MCP sidecar compatibility or document that this is a follow-up PR.

3. No retry logic in runner (_grpc_client.py:15-42)

def _fetch_token_from_cp(cp_token_url: str) -> str:
    ...
    with urllib.request.urlopen(req, timeout=10) as resp:
        body = json.loads(resp.read())

If the CP token endpoint is temporarily unavailable (e.g., during a rolling update), the runner will crash on startup or fail on reconnect with no retry. This makes deployments brittle.

Suggestion: Add exponential backoff retry (e.g., 3 attempts with 1s/2s/4s delays) for transient network issues.

4. Unused token expiration field (tokenserver/handler.go:597)

type tokenResponse struct {
    Token     string `json:"token"`
    ExpiresAt string `json:"expires_at,omitempty"`
}

The response includes expires_at but it's always empty and the runner doesn't use it. This could enable proactive token refresh before expiry.

Suggestion: Either populate expires_at from OIDCTokenProvider metadata, or remove the field if not needed.

5. Missing test coverage

No unit tests for:

  • internal/tokenserver package (TokenReview validation, SA pattern matching, error cases)
  • Python _fetch_token_from_cp() function
  • Integration test for token server → runner auth flow

The test plan in the PR description is entirely manual.

Suggestion: Add at minimum:

  • Unit tests for isRunnerSA() with various username patterns
  • Unit test for handleToken() with mocked TokenReview responses
  • Python unit test for token fetch with mocked HTTP responses

6. HTTP error handling in Python (_grpc_client.py:34)

with urllib.request.urlopen(req, timeout=10) as resp:
    body = json.loads(resp.read())

This will raise urllib.error.HTTPError for 4xx/5xx responses, but the exception message won't include the response body (which likely contains the CP's error message like "forbidden" or "unauthorized").

Suggestion:

try:
    with urllib.request.urlopen(req, timeout=10) as resp:
        body = json.loads(resp.read())
except urllib.error.HTTPError as e:
    error_body = e.read().decode('utf-8', errors='ignore')
    raise RuntimeError(f"CP token endpoint returned {e.code}: {error_body}") from e

7. Hardcoded token fetch timeout

urllib.request.urlopen(req, timeout=10) uses a 10s timeout, but this isn't configurable. In slow clusters or during CP restarts, this might be too aggressive.

Suggestion: Make timeout configurable via env var (e.g., AMBIENT_CP_TOKEN_TIMEOUT_SECS) with 10s default.

Minor observations

  • go.mod growth: Adds k8s.io/api and ~10 transitive deps. Necessary but worth noting for dependency audits.
  • Documentation accuracy: The spec says "Status: 🔲 planned — RHOAIENG-56711" but should be updated to "✅ implemented" when merged.
  • Service pattern brittleness: isRunnerSA() assumes session-*-sa naming. If SA naming changes, update this pattern or make it configurable.
  • Token server port: Hardcoded :8080 conflicts with many default ports. Configurable via CP_TOKEN_LISTEN_ADDR but worth documenting the port choice.

Recommendation

Approve with minor fixes for issues #3 (retry logic) and #6 (HTTP error messages). Issues #1, #2, and #5 can be follow-up PRs if time-constrained.

This PR significantly improves production reliability by eliminating a real race condition. The architecture is sound and the implementation is clean. Great work! 🎉

Copy link
Copy Markdown
Contributor Author

@markturansky markturansky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary

This is a well-architected solution to a real production issue. The shift from async Secret-push to synchronous token-pull eliminates the 3-way race (CP write → kubelet propagate → runner read) that was causing UNAUTHENTICATED loops. The implementation is clean, removes significant complexity (~200 LOC deleted), and uses K8s-native authentication properly.

Strengths

Clear problem statement: The async race between OIDC token TTL, Secret propagation delay, and runner read timing is well-documented and motivated
Significant cleanup: Removes ensureSecret, StartTokenRefreshLoop, refreshRunnerToken, and refreshAllRunningTokens — entire refresh mechanism eliminated
Security: K8s TokenReview API validation is the correct approach for SA token verification
Developer experience: Fallback to BOT_TOKEN env var preserves local dev workflow
Documentation: Excellent spec in control-plane.spec.md explaining rationale and design decisions
Production ready: Proper timeouts (TokenReview: 10s, HTTP server read/write: 10s), health endpoint, graceful shutdown

Issues & Questions

1. Goroutine error handling (main.go:180-191)

errCh := make(chan error, 1)
go func() {
    if err := startTokenServer(ctx, cfg, tokenProvider); err != nil {
        errCh <- err
    }
}()

infErr := inf.Run(ctx)
select {
case tsErr := <-errCh:
    return fmt.Errorf("token server: %w", tsErr)
default:
}

Issue: This only catches startup errors. If the token server starts successfully but crashes later (e.g., K8s client fails after 5 minutes), that error is silently lost. The select runs once after inf.Run() returns, but by then the token server goroutine may have exited long ago.

Suggestion: Either:

  • Use an error group pattern (golang.org/x/sync/errgroup) to properly coordinate goroutine lifetimes
  • Or accept that token server crashes are non-fatal and just log them (since runners will retry)

2. MCP sidecar compatibility (kube_reconciler.go:453)

envVar("AMBIENT_CP_TOKEN_URL", r.cfg.CPTokenURL),

The MCP sidecar now receives AMBIENT_CP_TOKEN_URL instead of BOT_TOKEN, but I don't see corresponding changes to the MCP sidecar implementation. Does the MCP sidecar code support fetching tokens from the CP endpoint, or does it still expect AMBIENT_TOKEN env var to contain the actual token?

Suggestion: Verify MCP sidecar compatibility or document that this is a follow-up PR.

3. No retry logic in runner (_grpc_client.py:15-42)

def _fetch_token_from_cp(cp_token_url: str) -> str:
    ...
    with urllib.request.urlopen(req, timeout=10) as resp:
        body = json.loads(resp.read())

If the CP token endpoint is temporarily unavailable (e.g., during a rolling update), the runner will crash on startup or fail on reconnect with no retry. This makes deployments brittle.

Suggestion: Add exponential backoff retry (e.g., 3 attempts with 1s/2s/4s delays) for transient network issues.

4. Unused token expiration field (tokenserver/handler.go:597)

type tokenResponse struct {
    Token     string `json:"token"`
    ExpiresAt string `json:"expires_at,omitempty"`
}

The response includes expires_at but it's always empty and the runner doesn't use it. This could enable proactive token refresh before expiry.

Suggestion: Either populate expires_at from OIDCTokenProvider metadata, or remove the field if not needed.

5. Missing test coverage

No unit tests for:

  • internal/tokenserver package (TokenReview validation, SA pattern matching, error cases)
  • Python _fetch_token_from_cp() function
  • Integration test for token server → runner auth flow

The test plan in the PR description is entirely manual.

Suggestion: Add at minimum:

  • Unit tests for isRunnerSA() with various username patterns
  • Unit test for handleToken() with mocked TokenReview responses
  • Python unit test for token fetch with mocked HTTP responses

6. HTTP error handling in Python (_grpc_client.py:34)

with urllib.request.urlopen(req, timeout=10) as resp:
    body = json.loads(resp.read())

This will raise urllib.error.HTTPError for 4xx/5xx responses, but the exception message won't include the response body (which likely contains the CP's error message like "forbidden" or "unauthorized").

Suggestion:

try:
    with urllib.request.urlopen(req, timeout=10) as resp:
        body = json.loads(resp.read())
except urllib.error.HTTPError as e:
    error_body = e.read().decode('utf-8', errors='ignore')
    raise RuntimeError(f"CP token endpoint returned {e.code}: {error_body}") from e

7. Hardcoded token fetch timeout

urllib.request.urlopen(req, timeout=10) uses a 10s timeout, but this isn't configurable. In slow clusters or during CP restarts, this might be too aggressive.

Suggestion: Make timeout configurable via env var (e.g., AMBIENT_CP_TOKEN_TIMEOUT_SECS) with 10s default.

Minor observations

  • go.mod growth: Adds k8s.io/api and ~10 transitive deps. Necessary but worth noting for dependency audits.
  • Documentation accuracy: The spec says "Status: 🔲 planned — RHOAIENG-56711" but should be updated to "✅ implemented" when merged.
  • Service pattern brittleness: isRunnerSA() assumes session-*-sa naming. If SA naming changes, update this pattern or make it configurable.
  • Token server port: Hardcoded :8080 conflicts with many default ports. Configurable via CP_TOKEN_LISTEN_ADDR but worth documenting the port choice.

Recommendation

Approve with minor fixes for issues #3 (retry logic) and #6 (HTTP error messages). Issues #1, #2, and #5 can be follow-up PRs if time-constrained.

This PR significantly improves production reliability by eliminating a real race condition. The architecture is sound and the implementation is clean. Great work! 🎉

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@components/ambient-control-plane/cmd/ambient-control-plane/main.go`:
- Around line 185-198: runKubeMode currently starts startTokenServer in a
goroutine and then calls inf.Run(ctx) but never observes errCh while inf.Run is
running, so a token server failure is dropped; change runKubeMode to propagate
token server errors by selecting between the token error channel and the
informer run (or cancelling the context) — for example, start startTokenServer
as you do but then either (a) run inf.Run in a goroutine and select on errCh and
a result channel from inf.Run, or (b) after starting both goroutines, use a
select that waits for tsErr from errCh and returns fmt.Errorf("token server:
%w", tsErr) (or cancels ctx) if received, and otherwise returns infErr when
inf.Run completes; reference runKubeMode, startTokenServer, errCh, and inf.Run
to locate the change.

In `@components/ambient-control-plane/internal/config/config.go`:
- Around line 76-77: The CPTokenURL field (CPTokenURL) is currently optional
(os.Getenv) which leaves pods without a credential source; change the config to
require AMBIENT_CP_TOKEN_URL by replacing the optional read with a required
lookup and failing fast when empty: use the existing envOrDefault helper or add
an envMust helper to read "CP_TOKEN_URL" (or "AMBIENT_CP_TOKEN_URL" if that's
the reconciler-injected name) and return an error from the config
constructor/newConfig function when the value is empty so the process exits
early; update any callers of the config constructor to handle the error
accordingly.

In `@components/ambient-control-plane/internal/tokenserver/handler.go`:
- Around line 56-58: The current check in the token handler (isRunnerSA(...) and
the branches that log and return forbidden) is too weak; instead resolve the
namespace/name from username, fetch the ServiceAccount via the Kubernetes client
available on the handler (e.g., h.kubeClient or the handler method that
processes the request), and verify this SA is a CP-managed session SA before
minting the token. Concretely: in the token handling code paths that call
isRunnerSA (and the separate block around lines 99-108), parse username to get
namespace and name, use the Kubernetes API to GET the ServiceAccount, and
validate it was created/owned by the control plane (check a reconciler-set label
or ownerReference) before proceeding to mint; if the SA is missing or not
CP-managed, return http.Error(w, "forbidden", http.StatusForbidden) and log via
h.logger.

In `@components/runners/ambient-runner/ambient_runner/_grpc_client.py`:
- Around line 3-6: Validate the AMBIENT_CP_TOKEN_URL before calling
urllib.request.urlopen: parse the AMBIENT_CP_TOKEN_URL variable (use
urllib.parse.urlparse) and ensure the scheme is a safe allowed value (e.g.,
"https") and host is non-empty; if the scheme is not allowed, log an error via
logging and skip/abort sending the Kubernetes SA token instead of calling
urllib.request.urlopen. Update the code paths that reference
AMBIENT_CP_TOKEN_URL and urllib.request.urlopen to perform this check and fail
closed to prevent token exfiltration.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 8377f68e-8fe2-4a78-92c0-9ec27cfb444d

📥 Commits

Reviewing files that changed from the base of the PR and between f7ef7d7 and 660c0f7.

⛔ Files ignored due to path filters (1)
  • components/ambient-control-plane/go.sum is excluded by !**/*.sum, !**/go.sum
📒 Files selected for processing (8)
  • components/ambient-control-plane/cmd/ambient-control-plane/main.go
  • components/ambient-control-plane/go.mod
  • components/ambient-control-plane/internal/config/config.go
  • components/ambient-control-plane/internal/reconciler/kube_reconciler.go
  • components/ambient-control-plane/internal/tokenserver/handler.go
  • components/ambient-control-plane/internal/tokenserver/server.go
  • components/runners/ambient-runner/ambient_runner/_grpc_client.py
  • docs/internal/design/control-plane.spec.md

Comment on lines +76 to +77
CPTokenListenAddr: envOrDefault("CP_TOKEN_LISTEN_ADDR", ":8080"),
CPTokenURL: os.Getenv("CP_TOKEN_URL"),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Make CP_TOKEN_URL required for kube session pods.

Line 77 leaves the new auth path optional, but the reconciler now injects only AMBIENT_CP_TOKEN_URL, and the runner falls back to BOT_TOKEN only when that value is empty. Since this PR removes BOT_TOKEN injection, omitting CP_TOKEN_URL leaves provisioned runner/MCP pods with no credential source and they fail auth on startup/reconnect.

Suggested fix
 	switch cfg.PlatformMode {
 	case "standard", "mpp":
 	default:
 		return nil, fmt.Errorf("unknown PLATFORM_MODE %q: must be one of standard, mpp", cfg.PlatformMode)
 	}
+
+	if cfg.Mode == "kube" && cfg.CPTokenURL == "" {
+		for _, name := range cfg.Reconcilers {
+			if name == "kube" {
+				return nil, fmt.Errorf("CP_TOKEN_URL must be set when the kube reconciler is enabled")
+			}
+		}
+	}
 
 	return cfg, nil
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@components/ambient-control-plane/internal/config/config.go` around lines 76 -
77, The CPTokenURL field (CPTokenURL) is currently optional (os.Getenv) which
leaves pods without a credential source; change the config to require
AMBIENT_CP_TOKEN_URL by replacing the optional read with a required lookup and
failing fast when empty: use the existing envOrDefault helper or add an envMust
helper to read "CP_TOKEN_URL" (or "AMBIENT_CP_TOKEN_URL" if that's the
reconciler-injected name) and return an error from the config
constructor/newConfig function when the value is empty so the process exits
early; update any callers of the config constructor to handle the error
accordingly.

Comment on lines +56 to +58
if !isRunnerSA(username) {
h.logger.Warn().Str("username", username).Msg("token request: username does not match runner SA pattern")
http.Error(w, "forbidden", http.StatusForbidden)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Name-pattern auth is too weak for /token.

Lines 56-58 and 99-108 mint the API token for any authenticated service account whose username matches system:serviceaccount:*:session-*-sa. TokenReview proves the caller owns some service account, but not that it is one created by this control plane. Any workload able to run under or create a matching SA in any namespace can obtain the API token.

Please bind issuance to CP-managed session SAs before returning a token, e.g. by resolving the namespace/name from username and verifying the ServiceAccount is one the reconciler created.

Also applies to: 99-108

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@components/ambient-control-plane/internal/tokenserver/handler.go` around
lines 56 - 58, The current check in the token handler (isRunnerSA(...) and the
branches that log and return forbidden) is too weak; instead resolve the
namespace/name from username, fetch the ServiceAccount via the Kubernetes client
available on the handler (e.g., h.kubeClient or the handler method that
processes the request), and verify this SA is a CP-managed session SA before
minting the token. Concretely: in the token handling code paths that call
isRunnerSA (and the separate block around lines 99-108), parse username to get
namespace and name, use the Kubernetes API to GET the ServiceAccount, and
validate it was created/owned by the control plane (check a reconciler-set label
or ownerReference) before proceeding to mint; if the SA is missing or not
CP-managed, return http.Error(w, "forbidden", http.StatusForbidden) and log via
h.logger.

… runner pods

Session namespace NetworkPolicies (provisioned by the MPP tenant operator)
block cross-namespace ingress by default. The api-server pod (in the
CP runtime namespace) could not reach runner pod port 8001, causing
`acpctl session events` to receive a 502 from the ingress.

Add ensureAPIServerNetworkPolicy() called from ensureNamespaceExists():
creates a NetworkPolicy named allow-ambient-api-server in each session
namespace that allows TCP:8001 ingress from pods in the CP runtime
namespace (matched by kubernetes.io/metadata.name label). Idempotent
— skips creation if the policy already exists.

Also add NetworkPolicyGVR, GetNetworkPolicy, and CreateNetworkPolicy
to KubeClient to support the new operation.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
@markturansky
Copy link
Copy Markdown
Contributor Author

Follow-up thoughts

After diving deeper into this change, I want to highlight what makes this PR particularly elegant from a systems design perspective:

The core insight: Token propagation delays aren't a bug to be optimized away — they're inherent to the Secret-push model. By flipping to synchronous pull, you've eliminated an entire class of failure (race conditions) rather than trying to tune timeouts and retry logic around it.

However, there's one critical gap that could bite in production: the runner has no retry logic when calling the CP token endpoint (_grpc_client.py:832-841). If the CP is mid-rolling-update during a runner reconnect, the fetch fails immediately and the runner enters an error loop.

Since the entire value prop here is reliability (eliminating races), it feels inconsistent to introduce a new single-point-of-failure without retry. A simple exponential backoff (3 attempts, 1s/2s/4s) would make this bulletproof for transient CP unavailability during deployments.

Minor question: The MCP sidecar gets AMBIENT_CP_TOKEN_URL injected but I don't see MCP sidecar code changes in this PR. Does it already support this pattern, or is that a follow-up PR?

Everything else looks production-ready. The TokenReview validation, PKCE challenge, graceful shutdown — all the hard parts are done correctly. Just needs that retry safety net.

…1213

main.go: run informer and token server concurrently in separate goroutines
and select on both error channels — a token server crash now immediately
fails runKubeMode instead of being silently swallowed after inf.Run returns.

tokenserver/handler.go: remove dead expires_at field from tokenResponse;
OIDCTokenProvider does not expose expiry metadata and the runner never
read it.

_grpc_client.py:
- Validate AMBIENT_CP_TOKEN_URL scheme (http/https only, no credentials)
  before sending the pod SA token, preventing exfiltration via misconfigured
  or malicious URLs (file://, ftp://, http://user:pass@...).
- Add exponential backoff retry (attempts: 1, backoff: 1s/2s) around
  _fetch_token_from_cp() to handle transient CP unavailability at runner
  startup or gRPC reconnect.
- Catch urllib.error.HTTPError separately and include the response body in
  the RuntimeError message for debuggability.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
components/ambient-control-plane/internal/tokenserver/handler.go (1)

55-61: ⚠️ Potential issue | 🔴 Critical

Authorize against a CP-managed ServiceAccount, not just a name pattern.

Lines 55-61 only check system:serviceaccount:*:session-*-sa before minting the API token. TokenReview proves the caller owns some ServiceAccount, but any workload that can run as or create a matching SA in any namespace can pass this gate and get a token. Parse the namespace/name from username and verify the ServiceAccount is one the reconciler created (for example via a label or owner reference) before calling Token().

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@components/ambient-control-plane/internal/tokenserver/handler.go` around
lines 55 - 61, The handler currently only validates the username pattern via
isRunnerSA(username) before minting an API token; instead parse the
ServiceAccount namespace/name from username (the TokenReview-provided string),
then verify that this ServiceAccount is managed by the control-plane reconciler
(e.g., check the SA's labels or ownerReferences via the k8s client) before
calling h.tokenProvider.Token(r.Context()); update the logic in the handler
around isRunnerSA, username parsing, and the pre-Token() check to reject
requests where the parsed ServiceAccount is not owned/labeled by the reconciler.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@components/runners/ambient-runner/ambient_runner/_grpc_client.py`:
- Around line 71-75: The code uses urllib.request.urlopen(req, ...) which honors
*_PROXY env vars and can leak the ServiceAccount bearer token (sa_token) through
proxies; replace that call with a proxyless opener created via
urllib.request.build_opener(urllib.request.ProxyHandler({})) and call
opener.open(req, timeout=_CP_TOKEN_FETCH_TIMEOUT) so the cp_token_url request
(where sa_token is set in headers) bypasses any HTTP proxies. Locate the request
construction and the urlopen call around cp_token_url / sa_token /
_CP_TOKEN_FETCH_TIMEOUT in _grpc_client.py and use the proxyless opener only for
this internal auth fetch.

---

Duplicate comments:
In `@components/ambient-control-plane/internal/tokenserver/handler.go`:
- Around line 55-61: The handler currently only validates the username pattern
via isRunnerSA(username) before minting an API token; instead parse the
ServiceAccount namespace/name from username (the TokenReview-provided string),
then verify that this ServiceAccount is managed by the control-plane reconciler
(e.g., check the SA's labels or ownerReferences via the k8s client) before
calling h.tokenProvider.Token(r.Context()); update the logic in the handler
around isRunnerSA, username parsing, and the pre-Token() check to reject
requests where the parsed ServiceAccount is not owned/labeled by the reconciler.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 70b1fd11-4b30-422d-84d7-4d1a74d16dd7

📥 Commits

Reviewing files that changed from the base of the PR and between 8833cce and 42092fe.

📒 Files selected for processing (3)
  • components/ambient-control-plane/cmd/ambient-control-plane/main.go
  • components/ambient-control-plane/internal/tokenserver/handler.go
  • components/runners/ambient-runner/ambient_runner/_grpc_client.py

Comment on lines +71 to +75
req = urllib.request.Request(
cp_token_url,
headers={"Authorization": f"Bearer {sa_token}"},
)
with urllib.request.urlopen(req, timeout=_CP_TOKEN_FETCH_TIMEOUT) as resp:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cat -n components/runners/ambient-runner/ambient_runner/_grpc_client.py | head -100

Repository: ambient-code/platform

Length of output: 4134


🏁 Script executed:

rg "ProxyHandler|NO_PROXY|_PROXY" components/runners/ambient-runner/ --no-heading

Repository: ambient-code/platform

Length of output: 47


🏁 Script executed:

git ls-files "components/runners/ambient-runner/**/*.py" | head -20

Repository: ambient-code/platform

Length of output: 1326


🏁 Script executed:

# Check if there are any other proxy handling mechanisms in the file
cat -n components/runners/ambient-runner/ambient_runner/_grpc_client.py | sed -n '1,150p'

Repository: ambient-code/platform

Length of output: 6209


🏁 Script executed:

# Check for any proxy configuration or environment handling in the entire runner
rg "http_proxy|https_proxy|NO_PROXY|ProxyHandler" components/runners/ambient-runner/ -i

Repository: ambient-code/platform

Length of output: 47


🏁 Script executed:

# Look for any existing documentation or tests related to proxy handling
find components/runners/ambient-runner -name "*.py" -type f -exec grep -l "proxy\|Proxy" {} \;

Repository: ambient-code/platform

Length of output: 337


🏁 Script executed:

# Check the files that mention proxy to see context
cat -n components/runners/ambient-runner/ambient_runner/endpoints/run.py | grep -C 5 -i proxy

Repository: ambient-code/platform

Length of output: 733


🏁 Script executed:

cat -n components/runners/ambient-runner/ambient_runner/platform/auth.py | grep -C 5 -i proxy

Repository: ambient-code/platform

Length of output: 783


🏁 Script executed:

# Check how AMBIENT_CP_TOKEN_URL is used/documented
rg "AMBIENT_CP_TOKEN_URL" components/runners/ambient-runner/ -B 2 -A 2

Repository: ambient-code/platform

Length of output: 573


🏁 Script executed:

# Check how the cp_token_url is typically set and what value it might have
rg "cp_token_url\|AMBIENT_CP_TOKEN_URL" components/runners/ambient-runner/ -B 2 -A 2

Repository: ambient-code/platform

Length of output: 47


🏁 Script executed:

# Check if there are any tests or examples showing typical usage
find components/runners/ambient-runner -name "test*.py" -o -name "*_test.py" | head -5

Repository: ambient-code/platform

Length of output: 379


🏁 Script executed:

# Search more broadly for cp_token_url usage
rg "cp_token_url" components/runners/ambient-runner/ -B 3 -A 3

Repository: ambient-code/platform

Length of output: 6945


🏁 Script executed:

# Check the full initialization code to see how cp_token_url gets passed
cat -n components/runners/ambient-runner/ambient_runner/_grpc_client.py | sed -n '136,180p'

Repository: ambient-code/platform

Length of output: 2030


🏁 Script executed:

# Look for environment variable usage or configuration
rg "os.environ\|os.getenv" components/runners/ambient-runner/ambient_runner/_grpc_client.py -B 2 -A 2

Repository: ambient-code/platform

Length of output: 47


Prevent ServiceAccount token leakage through HTTP proxies.

Line 75 uses urllib.request.urlopen(), which honors *_PROXY environment variables by default. If the runner pod inherits proxy configuration without NO_PROXY covering the control-plane service, the ServiceAccount bearer token (line 73) gets routed through that proxy on every token refresh, exposing the credential. Use a proxyless opener for this internal auth request.

Suggested fix
+_CP_TOKEN_OPENER = urllib.request.build_opener(urllib.request.ProxyHandler({}))
+
 def _fetch_token_from_cp(cp_token_url: str) -> str:
     """Fetch a fresh API token from the CP /token endpoint using the pod SA token.
     ...
     """
     _validate_cp_token_url(cp_token_url)
     ...
-            with urllib.request.urlopen(req, timeout=_CP_TOKEN_FETCH_TIMEOUT) as resp:
+            with _CP_TOKEN_OPENER.open(req, timeout=_CP_TOKEN_FETCH_TIMEOUT) as resp:
                 body = json.loads(resp.read())
🧰 Tools
🪛 Ruff (0.15.9)

[error] 71-74: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.

(S310)


[error] 75-75: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.

(S310)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@components/runners/ambient-runner/ambient_runner/_grpc_client.py` around
lines 71 - 75, The code uses urllib.request.urlopen(req, ...) which honors
*_PROXY env vars and can leak the ServiceAccount bearer token (sa_token) through
proxies; replace that call with a proxyless opener created via
urllib.request.build_opener(urllib.request.ProxyHandler({})) and call
opener.open(req, timeout=_CP_TOKEN_FETCH_TIMEOUT) so the cp_token_url request
(where sa_token is set in headers) bypasses any HTTP proxies. Locate the request
construction and the urlopen call around cp_token_url / sa_token /
_CP_TOKEN_FETCH_TIMEOUT in _grpc_client.py and use the proxyless opener only for
this internal auth fetch.

@markturansky markturansky merged commit 00c1a24 into alpha Apr 4, 2026
44 checks passed
@markturansky markturansky deleted the feat/RHOAIENG-56711-cp-token-endpoint branch April 4, 2026 23:33
markturansky added a commit that referenced this pull request Apr 5, 2026
…CP_TOKEN_URL (#1214)

## Summary

Follow-up to #1213. The CP token endpoint was running but unreachable
because:

1. **No Service** — port 8080 (token server) had no ClusterIP Service,
so runner pods had no DNS target and the NetworkPolicy peer had no
stable reference
2. **Wrong `CP_RUNTIME_NAMESPACE`** — defaulted to
`ambient-code--runtime-int` but the actual deployed namespace is
`ambient-code--ambient-s0`, so `ensureAPIServerNetworkPolicy()` was
creating a NetworkPolicy that matched the wrong namespace selector —
causing `acpctl session events` to still 502

## Changes

- `ambient-control-plane-svc.yaml` — new ClusterIP Service exposing port
8080 on the CP pod
- `ambient-control-plane.yaml` — inject `CP_RUNTIME_NAMESPACE` via
downward API (`metadata.namespace`) so the NetworkPolicy peer label
matches the actual runtime namespace; set `CP_TOKEN_URL` to the FQDN of
the new Service

## Test plan

- [ ] Deploy to int spoke
- [ ] Verify `oc get svc ambient-control-plane -n
ambient-code--ambient-s0` exists with port 8080
- [ ] Start new session; verify `allow-ambient-api-server` NetworkPolicy
in session namespace uses correct namespace selector
(`ambient-code--ambient-s0`)
- [ ] `acpctl session events <id>` streams without 502

🤖 Generated with [Claude Code](https://claude.ai/code)

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **New Features**
* Added ambient control plane service with token authentication
capability
* Configured control plane to integrate with token service endpoint for
runtime authentication

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Co-authored-by: Ambient Code Bot <bot@ambient-code.local>
Co-authored-by: Claude <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant