feat(control-plane): CP /token endpoint for runner gRPC auth RHOAIENG-56711#1213
Conversation
…wal RHOAIENG-56711 Documents the synchronous token fetch endpoint on the CP as the replacement for async Secret-write renewal. Covers the problem (three-way async race against OIDC TTL), solution (GET /token with K8s SA TokenReview auth), bootstrap vs renewal split, new internal/tokenserver packages, runner reconnect changes, and AMBIENT_CP_TOKEN_URL env var injection. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
CP creates the runner pod, so it is always up before the runner's first token request. Retaining a Secret write loop adds complexity and a second failure mode with identical blast radius. SA token automount flipped to true; BOT_TOKEN removed from env vars table and bootstrap split removed. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
…h RHOAIENG-56711 Eliminates the async BOT_TOKEN Secret push+refresh loop in favour of a synchronous pull model: - Add internal/tokenserver package: HTTP server on :8080 with GET /token handler that validates the caller's K8s SA token via TokenReview, checks the runner SA pattern (system:serviceaccount:*:session-*-sa), then mints and returns a fresh API token via the existing OIDCTokenProvider. - Wire token server goroutine into runKubeMode() alongside the informer. - Add CPTokenListenAddr / CPTokenURL fields to ControlPlaneConfig and KubeReconcilerConfig; read from CP_TOKEN_LISTEN_ADDR / CP_TOKEN_URL env. - kube_reconciler: remove ensureSecret, StartTokenRefreshLoop, refreshRunnerToken, refreshAllRunningTokens; replace BOT_TOKEN secret env injection with AMBIENT_CP_TOKEN_URL; set automountServiceAccountToken true on SA and pod so the K8s-mounted SA token is available at the standard path. - runner _grpc_client.py: add _fetch_token_from_cp() that reads the pod SA token and calls the CP /token endpoint; from_env() calls it when AMBIENT_CP_TOKEN_URL is set, falls back to BOT_TOKEN env var for local dev; reconnect() refreshes via CP endpoint on every gRPC reconnect. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
📝 WalkthroughWalkthroughAdds a control-plane HTTP token server (GET /token) that validates Kubernetes ServiceAccount bearer tokens via TokenReview and mints API tokens. Removes per-session secret-based token injection/refresh; runners use their mounted SA token to synchronously request API tokens from the CP endpoint. Changes
Sequence DiagramsequenceDiagram
participant Runner as Runner Pod
participant K8s as Kubernetes API
participant CP as CP Token Server
participant TP as TokenProvider
Runner->>Runner: Read SA token (mounted)
Runner->>CP: GET /token\nAuthorization: Bearer {SA_token}
CP->>K8s: TokenReview(SA_token)
K8s-->>CP: {authenticated, user: runner-sa}
CP->>CP: Validate runner SA name
CP->>TP: Token(ctx)
TP-->>CP: {api_token, expires_at}
CP-->>Runner: {"token": api_token}
Runner->>Runner: Use api_token for gRPC auth
Important Pre-merge checks failedPlease resolve all errors before merging. Addressing warnings is optional. ❌ Failed checks (1 error, 2 warnings)
✅ Passed checks (3 passed)
✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
✨ Simplify code
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
markturansky
left a comment
There was a problem hiding this comment.
Summary
This is a well-architected solution to a real production issue. The shift from async Secret-push to synchronous token-pull eliminates the 3-way race (CP write → kubelet propagate → runner read) that was causing UNAUTHENTICATED loops. The implementation is clean, removes significant complexity (~200 LOC deleted), and uses K8s-native authentication properly.
Strengths
✅ Clear problem statement: The async race between OIDC token TTL, Secret propagation delay, and runner read timing is well-documented and motivated
✅ Significant cleanup: Removes ensureSecret, StartTokenRefreshLoop, refreshRunnerToken, and refreshAllRunningTokens — entire refresh mechanism eliminated
✅ Security: K8s TokenReview API validation is the correct approach for SA token verification
✅ Developer experience: Fallback to BOT_TOKEN env var preserves local dev workflow
✅ Documentation: Excellent spec in control-plane.spec.md explaining rationale and design decisions
✅ Production ready: Proper timeouts (TokenReview: 10s, HTTP server read/write: 10s), health endpoint, graceful shutdown
Issues & Questions
1. Goroutine error handling (main.go:180-191)
errCh := make(chan error, 1)
go func() {
if err := startTokenServer(ctx, cfg, tokenProvider); err != nil {
errCh <- err
}
}()
infErr := inf.Run(ctx)
select {
case tsErr := <-errCh:
return fmt.Errorf("token server: %w", tsErr)
default:
}Issue: This only catches startup errors. If the token server starts successfully but crashes later (e.g., K8s client fails after 5 minutes), that error is silently lost. The select runs once after inf.Run() returns, but by then the token server goroutine may have exited long ago.
Suggestion: Either:
- Use an error group pattern (
golang.org/x/sync/errgroup) to properly coordinate goroutine lifetimes - Or accept that token server crashes are non-fatal and just log them (since runners will retry)
2. MCP sidecar compatibility (kube_reconciler.go:453)
envVar("AMBIENT_CP_TOKEN_URL", r.cfg.CPTokenURL),The MCP sidecar now receives AMBIENT_CP_TOKEN_URL instead of BOT_TOKEN, but I don't see corresponding changes to the MCP sidecar implementation. Does the MCP sidecar code support fetching tokens from the CP endpoint, or does it still expect AMBIENT_TOKEN env var to contain the actual token?
Suggestion: Verify MCP sidecar compatibility or document that this is a follow-up PR.
3. No retry logic in runner (_grpc_client.py:15-42)
def _fetch_token_from_cp(cp_token_url: str) -> str:
...
with urllib.request.urlopen(req, timeout=10) as resp:
body = json.loads(resp.read())If the CP token endpoint is temporarily unavailable (e.g., during a rolling update), the runner will crash on startup or fail on reconnect with no retry. This makes deployments brittle.
Suggestion: Add exponential backoff retry (e.g., 3 attempts with 1s/2s/4s delays) for transient network issues.
4. Unused token expiration field (tokenserver/handler.go:597)
type tokenResponse struct {
Token string `json:"token"`
ExpiresAt string `json:"expires_at,omitempty"`
}The response includes expires_at but it's always empty and the runner doesn't use it. This could enable proactive token refresh before expiry.
Suggestion: Either populate expires_at from OIDCTokenProvider metadata, or remove the field if not needed.
5. Missing test coverage
No unit tests for:
internal/tokenserverpackage (TokenReview validation, SA pattern matching, error cases)- Python
_fetch_token_from_cp()function - Integration test for token server → runner auth flow
The test plan in the PR description is entirely manual.
Suggestion: Add at minimum:
- Unit tests for
isRunnerSA()with various username patterns - Unit test for
handleToken()with mocked TokenReview responses - Python unit test for token fetch with mocked HTTP responses
6. HTTP error handling in Python (_grpc_client.py:34)
with urllib.request.urlopen(req, timeout=10) as resp:
body = json.loads(resp.read())This will raise urllib.error.HTTPError for 4xx/5xx responses, but the exception message won't include the response body (which likely contains the CP's error message like "forbidden" or "unauthorized").
Suggestion:
try:
with urllib.request.urlopen(req, timeout=10) as resp:
body = json.loads(resp.read())
except urllib.error.HTTPError as e:
error_body = e.read().decode('utf-8', errors='ignore')
raise RuntimeError(f"CP token endpoint returned {e.code}: {error_body}") from e7. Hardcoded token fetch timeout
urllib.request.urlopen(req, timeout=10) uses a 10s timeout, but this isn't configurable. In slow clusters or during CP restarts, this might be too aggressive.
Suggestion: Make timeout configurable via env var (e.g., AMBIENT_CP_TOKEN_TIMEOUT_SECS) with 10s default.
Minor observations
- go.mod growth: Adds
k8s.io/apiand ~10 transitive deps. Necessary but worth noting for dependency audits. - Documentation accuracy: The spec says "Status: 🔲 planned — RHOAIENG-56711" but should be updated to "✅ implemented" when merged.
- Service pattern brittleness:
isRunnerSA()assumessession-*-sanaming. If SA naming changes, update this pattern or make it configurable. - Token server port: Hardcoded
:8080conflicts with many default ports. Configurable viaCP_TOKEN_LISTEN_ADDRbut worth documenting the port choice.
Recommendation
Approve with minor fixes for issues #3 (retry logic) and #6 (HTTP error messages). Issues #1, #2, and #5 can be follow-up PRs if time-constrained.
This PR significantly improves production reliability by eliminating a real race condition. The architecture is sound and the implementation is clean. Great work! 🎉
markturansky
left a comment
There was a problem hiding this comment.
Summary
This is a well-architected solution to a real production issue. The shift from async Secret-push to synchronous token-pull eliminates the 3-way race (CP write → kubelet propagate → runner read) that was causing UNAUTHENTICATED loops. The implementation is clean, removes significant complexity (~200 LOC deleted), and uses K8s-native authentication properly.
Strengths
✅ Clear problem statement: The async race between OIDC token TTL, Secret propagation delay, and runner read timing is well-documented and motivated
✅ Significant cleanup: Removes ensureSecret, StartTokenRefreshLoop, refreshRunnerToken, and refreshAllRunningTokens — entire refresh mechanism eliminated
✅ Security: K8s TokenReview API validation is the correct approach for SA token verification
✅ Developer experience: Fallback to BOT_TOKEN env var preserves local dev workflow
✅ Documentation: Excellent spec in control-plane.spec.md explaining rationale and design decisions
✅ Production ready: Proper timeouts (TokenReview: 10s, HTTP server read/write: 10s), health endpoint, graceful shutdown
Issues & Questions
1. Goroutine error handling (main.go:180-191)
errCh := make(chan error, 1)
go func() {
if err := startTokenServer(ctx, cfg, tokenProvider); err != nil {
errCh <- err
}
}()
infErr := inf.Run(ctx)
select {
case tsErr := <-errCh:
return fmt.Errorf("token server: %w", tsErr)
default:
}Issue: This only catches startup errors. If the token server starts successfully but crashes later (e.g., K8s client fails after 5 minutes), that error is silently lost. The select runs once after inf.Run() returns, but by then the token server goroutine may have exited long ago.
Suggestion: Either:
- Use an error group pattern (
golang.org/x/sync/errgroup) to properly coordinate goroutine lifetimes - Or accept that token server crashes are non-fatal and just log them (since runners will retry)
2. MCP sidecar compatibility (kube_reconciler.go:453)
envVar("AMBIENT_CP_TOKEN_URL", r.cfg.CPTokenURL),The MCP sidecar now receives AMBIENT_CP_TOKEN_URL instead of BOT_TOKEN, but I don't see corresponding changes to the MCP sidecar implementation. Does the MCP sidecar code support fetching tokens from the CP endpoint, or does it still expect AMBIENT_TOKEN env var to contain the actual token?
Suggestion: Verify MCP sidecar compatibility or document that this is a follow-up PR.
3. No retry logic in runner (_grpc_client.py:15-42)
def _fetch_token_from_cp(cp_token_url: str) -> str:
...
with urllib.request.urlopen(req, timeout=10) as resp:
body = json.loads(resp.read())If the CP token endpoint is temporarily unavailable (e.g., during a rolling update), the runner will crash on startup or fail on reconnect with no retry. This makes deployments brittle.
Suggestion: Add exponential backoff retry (e.g., 3 attempts with 1s/2s/4s delays) for transient network issues.
4. Unused token expiration field (tokenserver/handler.go:597)
type tokenResponse struct {
Token string `json:"token"`
ExpiresAt string `json:"expires_at,omitempty"`
}The response includes expires_at but it's always empty and the runner doesn't use it. This could enable proactive token refresh before expiry.
Suggestion: Either populate expires_at from OIDCTokenProvider metadata, or remove the field if not needed.
5. Missing test coverage
No unit tests for:
internal/tokenserverpackage (TokenReview validation, SA pattern matching, error cases)- Python
_fetch_token_from_cp()function - Integration test for token server → runner auth flow
The test plan in the PR description is entirely manual.
Suggestion: Add at minimum:
- Unit tests for
isRunnerSA()with various username patterns - Unit test for
handleToken()with mocked TokenReview responses - Python unit test for token fetch with mocked HTTP responses
6. HTTP error handling in Python (_grpc_client.py:34)
with urllib.request.urlopen(req, timeout=10) as resp:
body = json.loads(resp.read())This will raise urllib.error.HTTPError for 4xx/5xx responses, but the exception message won't include the response body (which likely contains the CP's error message like "forbidden" or "unauthorized").
Suggestion:
try:
with urllib.request.urlopen(req, timeout=10) as resp:
body = json.loads(resp.read())
except urllib.error.HTTPError as e:
error_body = e.read().decode('utf-8', errors='ignore')
raise RuntimeError(f"CP token endpoint returned {e.code}: {error_body}") from e7. Hardcoded token fetch timeout
urllib.request.urlopen(req, timeout=10) uses a 10s timeout, but this isn't configurable. In slow clusters or during CP restarts, this might be too aggressive.
Suggestion: Make timeout configurable via env var (e.g., AMBIENT_CP_TOKEN_TIMEOUT_SECS) with 10s default.
Minor observations
- go.mod growth: Adds
k8s.io/apiand ~10 transitive deps. Necessary but worth noting for dependency audits. - Documentation accuracy: The spec says "Status: 🔲 planned — RHOAIENG-56711" but should be updated to "✅ implemented" when merged.
- Service pattern brittleness:
isRunnerSA()assumessession-*-sanaming. If SA naming changes, update this pattern or make it configurable. - Token server port: Hardcoded
:8080conflicts with many default ports. Configurable viaCP_TOKEN_LISTEN_ADDRbut worth documenting the port choice.
Recommendation
Approve with minor fixes for issues #3 (retry logic) and #6 (HTTP error messages). Issues #1, #2, and #5 can be follow-up PRs if time-constrained.
This PR significantly improves production reliability by eliminating a real race condition. The architecture is sound and the implementation is clean. Great work! 🎉
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@components/ambient-control-plane/cmd/ambient-control-plane/main.go`:
- Around line 185-198: runKubeMode currently starts startTokenServer in a
goroutine and then calls inf.Run(ctx) but never observes errCh while inf.Run is
running, so a token server failure is dropped; change runKubeMode to propagate
token server errors by selecting between the token error channel and the
informer run (or cancelling the context) — for example, start startTokenServer
as you do but then either (a) run inf.Run in a goroutine and select on errCh and
a result channel from inf.Run, or (b) after starting both goroutines, use a
select that waits for tsErr from errCh and returns fmt.Errorf("token server:
%w", tsErr) (or cancels ctx) if received, and otherwise returns infErr when
inf.Run completes; reference runKubeMode, startTokenServer, errCh, and inf.Run
to locate the change.
In `@components/ambient-control-plane/internal/config/config.go`:
- Around line 76-77: The CPTokenURL field (CPTokenURL) is currently optional
(os.Getenv) which leaves pods without a credential source; change the config to
require AMBIENT_CP_TOKEN_URL by replacing the optional read with a required
lookup and failing fast when empty: use the existing envOrDefault helper or add
an envMust helper to read "CP_TOKEN_URL" (or "AMBIENT_CP_TOKEN_URL" if that's
the reconciler-injected name) and return an error from the config
constructor/newConfig function when the value is empty so the process exits
early; update any callers of the config constructor to handle the error
accordingly.
In `@components/ambient-control-plane/internal/tokenserver/handler.go`:
- Around line 56-58: The current check in the token handler (isRunnerSA(...) and
the branches that log and return forbidden) is too weak; instead resolve the
namespace/name from username, fetch the ServiceAccount via the Kubernetes client
available on the handler (e.g., h.kubeClient or the handler method that
processes the request), and verify this SA is a CP-managed session SA before
minting the token. Concretely: in the token handling code paths that call
isRunnerSA (and the separate block around lines 99-108), parse username to get
namespace and name, use the Kubernetes API to GET the ServiceAccount, and
validate it was created/owned by the control plane (check a reconciler-set label
or ownerReference) before proceeding to mint; if the SA is missing or not
CP-managed, return http.Error(w, "forbidden", http.StatusForbidden) and log via
h.logger.
In `@components/runners/ambient-runner/ambient_runner/_grpc_client.py`:
- Around line 3-6: Validate the AMBIENT_CP_TOKEN_URL before calling
urllib.request.urlopen: parse the AMBIENT_CP_TOKEN_URL variable (use
urllib.parse.urlparse) and ensure the scheme is a safe allowed value (e.g.,
"https") and host is non-empty; if the scheme is not allowed, log an error via
logging and skip/abort sending the Kubernetes SA token instead of calling
urllib.request.urlopen. Update the code paths that reference
AMBIENT_CP_TOKEN_URL and urllib.request.urlopen to perform this check and fail
closed to prevent token exfiltration.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 8377f68e-8fe2-4a78-92c0-9ec27cfb444d
⛔ Files ignored due to path filters (1)
components/ambient-control-plane/go.sumis excluded by!**/*.sum,!**/go.sum
📒 Files selected for processing (8)
components/ambient-control-plane/cmd/ambient-control-plane/main.gocomponents/ambient-control-plane/go.modcomponents/ambient-control-plane/internal/config/config.gocomponents/ambient-control-plane/internal/reconciler/kube_reconciler.gocomponents/ambient-control-plane/internal/tokenserver/handler.gocomponents/ambient-control-plane/internal/tokenserver/server.gocomponents/runners/ambient-runner/ambient_runner/_grpc_client.pydocs/internal/design/control-plane.spec.md
components/ambient-control-plane/cmd/ambient-control-plane/main.go
Outdated
Show resolved
Hide resolved
| CPTokenListenAddr: envOrDefault("CP_TOKEN_LISTEN_ADDR", ":8080"), | ||
| CPTokenURL: os.Getenv("CP_TOKEN_URL"), |
There was a problem hiding this comment.
Make CP_TOKEN_URL required for kube session pods.
Line 77 leaves the new auth path optional, but the reconciler now injects only AMBIENT_CP_TOKEN_URL, and the runner falls back to BOT_TOKEN only when that value is empty. Since this PR removes BOT_TOKEN injection, omitting CP_TOKEN_URL leaves provisioned runner/MCP pods with no credential source and they fail auth on startup/reconnect.
Suggested fix
switch cfg.PlatformMode {
case "standard", "mpp":
default:
return nil, fmt.Errorf("unknown PLATFORM_MODE %q: must be one of standard, mpp", cfg.PlatformMode)
}
+
+ if cfg.Mode == "kube" && cfg.CPTokenURL == "" {
+ for _, name := range cfg.Reconcilers {
+ if name == "kube" {
+ return nil, fmt.Errorf("CP_TOKEN_URL must be set when the kube reconciler is enabled")
+ }
+ }
+ }
return cfg, nil
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@components/ambient-control-plane/internal/config/config.go` around lines 76 -
77, The CPTokenURL field (CPTokenURL) is currently optional (os.Getenv) which
leaves pods without a credential source; change the config to require
AMBIENT_CP_TOKEN_URL by replacing the optional read with a required lookup and
failing fast when empty: use the existing envOrDefault helper or add an envMust
helper to read "CP_TOKEN_URL" (or "AMBIENT_CP_TOKEN_URL" if that's the
reconciler-injected name) and return an error from the config
constructor/newConfig function when the value is empty so the process exits
early; update any callers of the config constructor to handle the error
accordingly.
| if !isRunnerSA(username) { | ||
| h.logger.Warn().Str("username", username).Msg("token request: username does not match runner SA pattern") | ||
| http.Error(w, "forbidden", http.StatusForbidden) |
There was a problem hiding this comment.
Name-pattern auth is too weak for /token.
Lines 56-58 and 99-108 mint the API token for any authenticated service account whose username matches system:serviceaccount:*:session-*-sa. TokenReview proves the caller owns some service account, but not that it is one created by this control plane. Any workload able to run under or create a matching SA in any namespace can obtain the API token.
Please bind issuance to CP-managed session SAs before returning a token, e.g. by resolving the namespace/name from username and verifying the ServiceAccount is one the reconciler created.
Also applies to: 99-108
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@components/ambient-control-plane/internal/tokenserver/handler.go` around
lines 56 - 58, The current check in the token handler (isRunnerSA(...) and the
branches that log and return forbidden) is too weak; instead resolve the
namespace/name from username, fetch the ServiceAccount via the Kubernetes client
available on the handler (e.g., h.kubeClient or the handler method that
processes the request), and verify this SA is a CP-managed session SA before
minting the token. Concretely: in the token handling code paths that call
isRunnerSA (and the separate block around lines 99-108), parse username to get
namespace and name, use the Kubernetes API to GET the ServiceAccount, and
validate it was created/owned by the control plane (check a reconciler-set label
or ownerReference) before proceeding to mint; if the SA is missing or not
CP-managed, return http.Error(w, "forbidden", http.StatusForbidden) and log via
h.logger.
… runner pods Session namespace NetworkPolicies (provisioned by the MPP tenant operator) block cross-namespace ingress by default. The api-server pod (in the CP runtime namespace) could not reach runner pod port 8001, causing `acpctl session events` to receive a 502 from the ingress. Add ensureAPIServerNetworkPolicy() called from ensureNamespaceExists(): creates a NetworkPolicy named allow-ambient-api-server in each session namespace that allows TCP:8001 ingress from pods in the CP runtime namespace (matched by kubernetes.io/metadata.name label). Idempotent — skips creation if the policy already exists. Also add NetworkPolicyGVR, GetNetworkPolicy, and CreateNetworkPolicy to KubeClient to support the new operation. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
Follow-up thoughtsAfter diving deeper into this change, I want to highlight what makes this PR particularly elegant from a systems design perspective: The core insight: Token propagation delays aren't a bug to be optimized away — they're inherent to the Secret-push model. By flipping to synchronous pull, you've eliminated an entire class of failure (race conditions) rather than trying to tune timeouts and retry logic around it. However, there's one critical gap that could bite in production: the runner has no retry logic when calling the CP token endpoint ( Since the entire value prop here is reliability (eliminating races), it feels inconsistent to introduce a new single-point-of-failure without retry. A simple exponential backoff (3 attempts, 1s/2s/4s) would make this bulletproof for transient CP unavailability during deployments. Minor question: The MCP sidecar gets Everything else looks production-ready. The TokenReview validation, PKCE challenge, graceful shutdown — all the hard parts are done correctly. Just needs that retry safety net. |
…1213 main.go: run informer and token server concurrently in separate goroutines and select on both error channels — a token server crash now immediately fails runKubeMode instead of being silently swallowed after inf.Run returns. tokenserver/handler.go: remove dead expires_at field from tokenResponse; OIDCTokenProvider does not expose expiry metadata and the runner never read it. _grpc_client.py: - Validate AMBIENT_CP_TOKEN_URL scheme (http/https only, no credentials) before sending the pod SA token, preventing exfiltration via misconfigured or malicious URLs (file://, ftp://, http://user:pass@...). - Add exponential backoff retry (attempts: 1, backoff: 1s/2s) around _fetch_token_from_cp() to handle transient CP unavailability at runner startup or gRPC reconnect. - Catch urllib.error.HTTPError separately and include the response body in the RuntimeError message for debuggability. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
components/ambient-control-plane/internal/tokenserver/handler.go (1)
55-61:⚠️ Potential issue | 🔴 CriticalAuthorize against a CP-managed ServiceAccount, not just a name pattern.
Lines 55-61 only check
system:serviceaccount:*:session-*-sabefore minting the API token. TokenReview proves the caller owns some ServiceAccount, but any workload that can run as or create a matching SA in any namespace can pass this gate and get a token. Parse the namespace/name fromusernameand verify the ServiceAccount is one the reconciler created (for example via a label or owner reference) before callingToken().🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@components/ambient-control-plane/internal/tokenserver/handler.go` around lines 55 - 61, The handler currently only validates the username pattern via isRunnerSA(username) before minting an API token; instead parse the ServiceAccount namespace/name from username (the TokenReview-provided string), then verify that this ServiceAccount is managed by the control-plane reconciler (e.g., check the SA's labels or ownerReferences via the k8s client) before calling h.tokenProvider.Token(r.Context()); update the logic in the handler around isRunnerSA, username parsing, and the pre-Token() check to reject requests where the parsed ServiceAccount is not owned/labeled by the reconciler.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@components/runners/ambient-runner/ambient_runner/_grpc_client.py`:
- Around line 71-75: The code uses urllib.request.urlopen(req, ...) which honors
*_PROXY env vars and can leak the ServiceAccount bearer token (sa_token) through
proxies; replace that call with a proxyless opener created via
urllib.request.build_opener(urllib.request.ProxyHandler({})) and call
opener.open(req, timeout=_CP_TOKEN_FETCH_TIMEOUT) so the cp_token_url request
(where sa_token is set in headers) bypasses any HTTP proxies. Locate the request
construction and the urlopen call around cp_token_url / sa_token /
_CP_TOKEN_FETCH_TIMEOUT in _grpc_client.py and use the proxyless opener only for
this internal auth fetch.
---
Duplicate comments:
In `@components/ambient-control-plane/internal/tokenserver/handler.go`:
- Around line 55-61: The handler currently only validates the username pattern
via isRunnerSA(username) before minting an API token; instead parse the
ServiceAccount namespace/name from username (the TokenReview-provided string),
then verify that this ServiceAccount is managed by the control-plane reconciler
(e.g., check the SA's labels or ownerReferences via the k8s client) before
calling h.tokenProvider.Token(r.Context()); update the logic in the handler
around isRunnerSA, username parsing, and the pre-Token() check to reject
requests where the parsed ServiceAccount is not owned/labeled by the reconciler.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 70b1fd11-4b30-422d-84d7-4d1a74d16dd7
📒 Files selected for processing (3)
components/ambient-control-plane/cmd/ambient-control-plane/main.gocomponents/ambient-control-plane/internal/tokenserver/handler.gocomponents/runners/ambient-runner/ambient_runner/_grpc_client.py
| req = urllib.request.Request( | ||
| cp_token_url, | ||
| headers={"Authorization": f"Bearer {sa_token}"}, | ||
| ) | ||
| with urllib.request.urlopen(req, timeout=_CP_TOKEN_FETCH_TIMEOUT) as resp: |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
cat -n components/runners/ambient-runner/ambient_runner/_grpc_client.py | head -100Repository: ambient-code/platform
Length of output: 4134
🏁 Script executed:
rg "ProxyHandler|NO_PROXY|_PROXY" components/runners/ambient-runner/ --no-headingRepository: ambient-code/platform
Length of output: 47
🏁 Script executed:
git ls-files "components/runners/ambient-runner/**/*.py" | head -20Repository: ambient-code/platform
Length of output: 1326
🏁 Script executed:
# Check if there are any other proxy handling mechanisms in the file
cat -n components/runners/ambient-runner/ambient_runner/_grpc_client.py | sed -n '1,150p'Repository: ambient-code/platform
Length of output: 6209
🏁 Script executed:
# Check for any proxy configuration or environment handling in the entire runner
rg "http_proxy|https_proxy|NO_PROXY|ProxyHandler" components/runners/ambient-runner/ -iRepository: ambient-code/platform
Length of output: 47
🏁 Script executed:
# Look for any existing documentation or tests related to proxy handling
find components/runners/ambient-runner -name "*.py" -type f -exec grep -l "proxy\|Proxy" {} \;Repository: ambient-code/platform
Length of output: 337
🏁 Script executed:
# Check the files that mention proxy to see context
cat -n components/runners/ambient-runner/ambient_runner/endpoints/run.py | grep -C 5 -i proxyRepository: ambient-code/platform
Length of output: 733
🏁 Script executed:
cat -n components/runners/ambient-runner/ambient_runner/platform/auth.py | grep -C 5 -i proxyRepository: ambient-code/platform
Length of output: 783
🏁 Script executed:
# Check how AMBIENT_CP_TOKEN_URL is used/documented
rg "AMBIENT_CP_TOKEN_URL" components/runners/ambient-runner/ -B 2 -A 2Repository: ambient-code/platform
Length of output: 573
🏁 Script executed:
# Check how the cp_token_url is typically set and what value it might have
rg "cp_token_url\|AMBIENT_CP_TOKEN_URL" components/runners/ambient-runner/ -B 2 -A 2Repository: ambient-code/platform
Length of output: 47
🏁 Script executed:
# Check if there are any tests or examples showing typical usage
find components/runners/ambient-runner -name "test*.py" -o -name "*_test.py" | head -5Repository: ambient-code/platform
Length of output: 379
🏁 Script executed:
# Search more broadly for cp_token_url usage
rg "cp_token_url" components/runners/ambient-runner/ -B 3 -A 3Repository: ambient-code/platform
Length of output: 6945
🏁 Script executed:
# Check the full initialization code to see how cp_token_url gets passed
cat -n components/runners/ambient-runner/ambient_runner/_grpc_client.py | sed -n '136,180p'Repository: ambient-code/platform
Length of output: 2030
🏁 Script executed:
# Look for environment variable usage or configuration
rg "os.environ\|os.getenv" components/runners/ambient-runner/ambient_runner/_grpc_client.py -B 2 -A 2Repository: ambient-code/platform
Length of output: 47
Prevent ServiceAccount token leakage through HTTP proxies.
Line 75 uses urllib.request.urlopen(), which honors *_PROXY environment variables by default. If the runner pod inherits proxy configuration without NO_PROXY covering the control-plane service, the ServiceAccount bearer token (line 73) gets routed through that proxy on every token refresh, exposing the credential. Use a proxyless opener for this internal auth request.
Suggested fix
+_CP_TOKEN_OPENER = urllib.request.build_opener(urllib.request.ProxyHandler({}))
+
def _fetch_token_from_cp(cp_token_url: str) -> str:
"""Fetch a fresh API token from the CP /token endpoint using the pod SA token.
...
"""
_validate_cp_token_url(cp_token_url)
...
- with urllib.request.urlopen(req, timeout=_CP_TOKEN_FETCH_TIMEOUT) as resp:
+ with _CP_TOKEN_OPENER.open(req, timeout=_CP_TOKEN_FETCH_TIMEOUT) as resp:
body = json.loads(resp.read())🧰 Tools
🪛 Ruff (0.15.9)
[error] 71-74: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.
(S310)
[error] 75-75: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.
(S310)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@components/runners/ambient-runner/ambient_runner/_grpc_client.py` around
lines 71 - 75, The code uses urllib.request.urlopen(req, ...) which honors
*_PROXY env vars and can leak the ServiceAccount bearer token (sa_token) through
proxies; replace that call with a proxyless opener created via
urllib.request.build_opener(urllib.request.ProxyHandler({})) and call
opener.open(req, timeout=_CP_TOKEN_FETCH_TIMEOUT) so the cp_token_url request
(where sa_token is set in headers) bypasses any HTTP proxies. Locate the request
construction and the urlopen call around cp_token_url / sa_token /
_CP_TOKEN_FETCH_TIMEOUT in _grpc_client.py and use the proxyless opener only for
this internal auth fetch.
…CP_TOKEN_URL (#1214) ## Summary Follow-up to #1213. The CP token endpoint was running but unreachable because: 1. **No Service** — port 8080 (token server) had no ClusterIP Service, so runner pods had no DNS target and the NetworkPolicy peer had no stable reference 2. **Wrong `CP_RUNTIME_NAMESPACE`** — defaulted to `ambient-code--runtime-int` but the actual deployed namespace is `ambient-code--ambient-s0`, so `ensureAPIServerNetworkPolicy()` was creating a NetworkPolicy that matched the wrong namespace selector — causing `acpctl session events` to still 502 ## Changes - `ambient-control-plane-svc.yaml` — new ClusterIP Service exposing port 8080 on the CP pod - `ambient-control-plane.yaml` — inject `CP_RUNTIME_NAMESPACE` via downward API (`metadata.namespace`) so the NetworkPolicy peer label matches the actual runtime namespace; set `CP_TOKEN_URL` to the FQDN of the new Service ## Test plan - [ ] Deploy to int spoke - [ ] Verify `oc get svc ambient-control-plane -n ambient-code--ambient-s0` exists with port 8080 - [ ] Start new session; verify `allow-ambient-api-server` NetworkPolicy in session namespace uses correct namespace selector (`ambient-code--ambient-s0`) - [ ] `acpctl session events <id>` streams without 502 🤖 Generated with [Claude Code](https://claude.ai/code) <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Added ambient control plane service with token authentication capability * Configured control plane to integrate with token service endpoint for runtime authentication <!-- end of auto-generated comment: release notes by coderabbit.ai --> Co-authored-by: Ambient Code Bot <bot@ambient-code.local> Co-authored-by: Claude <noreply@anthropic.com>
Summary
internal/tokenserverpackage —GET /tokenvalidates the caller's K8s SA token via TokenReview, checkssystem:serviceaccount:*:session-*-sapattern, returns a fresh API token viaOIDCTokenProvider_fetch_token_from_cp()reads pod SA token from standard K8s mount, calls CP/tokenon startup and every gRPC reconnect; falls back toBOT_TOKENenv var for local dev whenAMBIENT_CP_TOKEN_URLunsetkube_reconcilercleaned up — removesensureSecret,StartTokenRefreshLoop,refreshRunnerToken,refreshAllRunningTokens; setsautomountServiceAccountToken: true; injectsAMBIENT_CP_TOKEN_URLinstead ofBOT_TOKENsecret refDesign rationale
CP is the thing rotating the secret — if it's down for
/tokenit's down for cycling too. Synchronous pull eliminates the 3-way race (CP ticker → kubelet propagation → runner read) and removes a class of stale-token failures.Spec:
docs/internal/design/control-plane.spec.mdTest plan
CP_TOKEN_LISTEN_ADDR=:8080andCP_TOKEN_URL=http://<cp-svc>:8080/tokenautomountServiceAccountToken: trueandAMBIENT_CP_TOKEN_URLenv set[GRPC CLIENT] Fetched fresh API token from CP token endpoint)GET /healthzon token server returns 200AMBIENT_CP_TOKEN_URL, setBOT_TOKEN; confirm fallback path works🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Configuration
Documentation