fix(api-server): authorize runner OIDC service account in WatchSessionMessages#1183
Conversation
|
Caution Review failedPull request was closed or merged during review 📝 WalkthroughWalkthroughThis PR introduces session-based service account authentication, improves credential patch change detection to avoid unnecessary updates, and adds background token refresh for running SDK sessions. Authorization gates skip ownership validation when requests originate from configured service accounts. Pod volumes now mount credential tokens, and periodic background loops refresh them. Changes
Sequence DiagramssequenceDiagram
participant Timer as 10-min Ticker
participant Reconciler as SimpleKubeReconciler
participant KubeClient as KubeClient
participant API as Kubernetes API
participant Secret as Session Secret
Timer->>Reconciler: tick
Reconciler->>Reconciler: refreshAllRunningTokens()
Reconciler->>KubeClient: ListSessions(Running phase)
KubeClient->>API: List SDK Sessions
API-->>KubeClient: session list
KubeClient-->>Reconciler: sessions
loop for each running session
Reconciler->>Reconciler: compute namespace
Reconciler->>Reconciler: refreshRunnerToken(sessionID)
Reconciler->>KubeClient: GetSecret(by sessionID name)
KubeClient->>API: Get Secret
API-->>Secret: retrieved
Secret-->>KubeClient: secret data
KubeClient-->>Reconciler: secret obj
Reconciler->>Reconciler: resolve fresh token
Reconciler->>KubeClient: UpdateSecret(api-token)
KubeClient->>API: Update Secret
API-->>KubeClient: updated
end
sequenceDiagram
participant Source as Event Source
participant Dispatcher as dispatchEvent()
participant Handler as Handler[i]
participant Retry as scheduleRetry()
participant Queue as Retry Queue
participant Timer as time.NewTimer()
Source->>Dispatcher: ResourceEvent
Dispatcher->>Dispatcher: iterate handlers by index
rect rgba(200, 100, 100, 0.5)
Dispatcher->>Handler: invoke handler[i]
Handler-->>Dispatcher: error
end
Dispatcher->>Retry: ctx, event, handlerIndex, attempt, err
Retry->>Retry: compute exponential backoff + cap
Retry->>Queue: enqueue retryEvent{event, handlerIndex, attempt, fireAt}
Retry->>Retry: log warning with handler index
Queue->>Timer: wait until fireAt
Timer-->>Queue: timer fires
Queue->>Dispatcher: dispatchHandler(ctx, event, handlerIndex, attempt)
Dispatcher->>Handler: invoke handler[handlerIndex] only
Handler-->>Dispatcher: result
Important Pre-merge checks failedPlease resolve all errors before merging. Addressing warnings is optional. ❌ Failed checks (1 error, 2 warnings)
✅ Passed checks (3 passed)
✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
✨ Simplify code
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 6
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@components/ambient-cli/cmd/acpctl/apply/cmd.go`:
- Around line 25-31: The Credential overlay fields provider, token, url, and
email are being dropped during kustomize handling because strategicMerge() does
not carry those keys through, so acpctl apply -k can reconcile an incomplete
Credential; update strategicMerge() (and the code paths that call it) so that
when the resource Kind is "Credential" it copies/merges provider, token, url,
and email from the overlay into the resulting merged map/Unstructured before
applyCredential() runs (or alternatively make applyCredential() read from the
merged Unstructured returned by strategicMerge() rather than the original),
ensuring the overlay values are preserved and used during apply.
- Around line 290-319: buildCredentialPatch currently marks url, email, token,
labels and annotations as changed whenever present instead of comparing them to
existing values; update buildCredentialPatch (and use
sdktypes.NewCredentialPatchBuilder) to only call
patch.Url/Email/Token/Labels/Annotations and set changed=true when the new value
differs from the corresponding existing field (compare token after os.ExpandEnv
to existing.Token, compare URL and Email strings, and compare labels/annotations
by converting with marshalStringMap and using a map-equality check such as
reflect.DeepEqual or a helper equality function).
In `@components/ambient-cli/demo-github.sh`:
- Around line 281-287: The current credential creation passes the GitHub PAT on
the command line via the `--token "${GITHUB_TOKEN_VALUE}"` argument (in the
`CRED_JSON=$( "$ACPCTL" credential create ... )` block), which exposes the
secret in process arguments; change the `credential create` invocation to avoid
`--token` and instead provide the PAT via a secure channel such as stdin (pipe
the token into the `"$ACPCTL" credential create` command) or use the
manifest/env-expansion mechanism supported by `credential create` (i.e.,
reference an env/manifest variable rather than passing `GITHUB_TOKEN_VALUE` on
argv) so the secret is not visible in process args or logs.
- Around line 302-333: The script currently looks up and binds the wrong role
name 'credential:reader'; change the JSON lookup to find
'credential:token-reader' instead of 'credential:reader' (update the python3
snippet that sets READER_ROLE_ID to search for 'credential:token-reader' and
consider renaming READER_ROLE_ID to TOKEN_READER_ROLE_ID for clarity), and when
creating the role-binding via the "$ACPCTL" create role-binding command, use
that token-reader role ID variable (the same variable used in the dim output and
passed to --role-id) so the binding grants credential:token-reader rather than
credential:reader.
In `@components/ambient-control-plane/internal/informer/informer.go`:
- Around line 88-92: The retry logic currently requeues the entire ResourceEvent
(type retryEvent) on any handler error which causes exponential fan-out; modify
retryEvent to include a handler identifier (e.g., failedHandlerID string or
handlerIndex int) and change the dispatch/enqueue logic (the code that writes to
retryCh and the event processing loop) so only the single failing handler is
retried instead of the whole event; update handler registration/dispatch code to
pass and consume that handler identifier (and track per-handler attempt counts
in retryEvent.attempt) so retries target a specific handler and do not duplicate
dispatches to all handlers.
- Around line 157-171: The retryLoop in Informer reads a single entry from
inf.retryCh and sleeps for that entry's fireAt, which blocks newer, earlier-fire
entries behind long waits; change the loop to always determine the earliest
pending fireAt before sleeping (e.g., drain or peek the channel into a
min-heap/priority queue keyed by re.fireAt, or maintain a separate priority
structure) and only sleep until that earliest timestamp, then pop and call
inf.dispatchEvent with the popped retry's event and attempt; ensure reads from
inf.retryCh insert into the priority queue and that ctx.Done cancels the sleep
so newer short-backoff retries are not starved by older long-backoff ones.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: ee4f0025-0d42-4cfb-b8f3-f6dd57bd7d3b
📒 Files selected for processing (11)
components/ambient-api-server/plugins/sessions/grpc_handler.gocomponents/ambient-api-server/plugins/sessions/plugin.gocomponents/ambient-cli/cmd/acpctl/agent/cmd.gocomponents/ambient-cli/cmd/acpctl/apply/cmd.gocomponents/ambient-cli/cmd/acpctl/delete/cmd.gocomponents/ambient-cli/cmd/acpctl/describe/cmd.gocomponents/ambient-cli/cmd/acpctl/get/cmd.gocomponents/ambient-cli/demo-github.shcomponents/ambient-control-plane/internal/informer/informer.gocomponents/ambient-control-plane/internal/reconciler/kube_reconciler.gocomponents/ambient-control-plane/internal/reconciler/project_reconciler.go
💤 Files with no reviewable changes (1)
- components/ambient-control-plane/internal/reconciler/kube_reconciler.go
| func (inf *Informer) retryLoop(ctx context.Context) { | ||
| for { | ||
| select { | ||
| case <-ctx.Done(): | ||
| return | ||
| case re := <-inf.retryCh: | ||
| wait := time.Until(re.fireAt) | ||
| if wait > 0 { | ||
| select { | ||
| case <-time.After(wait): | ||
| case <-ctx.Done(): | ||
| return | ||
| } | ||
| } | ||
| inf.dispatchEvent(ctx, re.event, re.attempt) |
There was a problem hiding this comment.
The retry worker can starve fresh retries behind older long-backoff entries.
This loop sleeps on the first dequeued retry instead of the earliest fireAt. If a 30s retry is read before a newer 2s retry, the 2s retry sits in retryCh until the 30s sleep finishes. That violates the configured backoff timing and delays retries across all resources far longer than intended.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@components/ambient-control-plane/internal/informer/informer.go` around lines
157 - 171, The retryLoop in Informer reads a single entry from inf.retryCh and
sleeps for that entry's fireAt, which blocks newer, earlier-fire entries behind
long waits; change the loop to always determine the earliest pending fireAt
before sleeping (e.g., drain or peek the channel into a min-heap/priority queue
keyed by re.fireAt, or maintain a separate priority structure) and only sleep
until that earliest timestamp, then pop and call inf.dispatchEvent with the
popped retry's event and attempt; ensure reads from inf.retryCh insert into the
priority queue and that ctx.Done cancels the sleep so newer short-backoff
retries are not starved by older long-backoff ones.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@components/ambient-control-plane/internal/reconciler/kube_reconciler.go`:
- Around line 920-940: The refreshAllRunningTokens function currently only lists
page 1 with Size 500 so it misses sessions on later pages; change it to paginate
through all pages by looping with a page counter (e.g., page := 1; for { ...
page++ }) calling sdk.Sessions().List(ctx, &types.ListOptions{Page: page, Size:
500, Search: "phase = 'Running'"}), process each list.Items as before, and break
the loop when len(list.Items) == 0 or len(list.Items) < 500 (or when a
total/pages field indicates no more pages); ensure you continue on per-page List
errors by logging and either retrying or breaking as appropriate so
refreshRunnerToken(session.ID) is attempted for every running session across all
pages.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: e501c092-8ef5-42db-9418-190f720be46f
📒 Files selected for processing (3)
components/ambient-control-plane/cmd/ambient-control-plane/main.gocomponents/ambient-control-plane/internal/kubeclient/kubeclient.gocomponents/ambient-control-plane/internal/reconciler/kube_reconciler.go
informer:
- Retry per failing handler (not whole event) to prevent exponential
fan-out when multiple handlers are registered for a resource
- Use time.NewTimer per entry instead of blocking sleep on dequeued item,
eliminating retry starvation when long-backoff entries precede short ones
- Add handler index to retryEvent; introduce dispatchHandler and
scheduleRetry helpers
control-plane:
- Token refresh loop now paginates all running sessions (100/page)
instead of capping at a single page of 500
cli/apply:
- strategicMerge now propagates Credential fields (provider, token,
url, email) so acpctl apply -k overlays work correctly
- buildCredentialPatch compares each field against the existing value
before marking changed, making repeated applies idempotent
demo-github.sh:
- Pass GitHub PAT via apply -f manifest with env-var expansion
($DEMO_GITHUB_PAT) rather than --token argv to avoid exposure in
ps/crash dumps; fetch credential ID with get credentials post-apply
- Look up credential:token-reader role (not credential:reader) so the
session can fetch tokens via GET /credentials/{id}/token
🤖 Generated with [Claude Code](https://claude.ai/code)
Co-Authored-By: Claude <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
components/ambient-control-plane/internal/informer/informer.go (1)
158-175:⚠️ Potential issue | 🟠 MajorRetry timing is still FIFO, not deadline-based.
Lines 163-172 sleep on the first dequeued item. If a long-backoff retry is read before a newer short-backoff retry, the short retry waits behind it; once
retryChbacks up, the send on Line 220 blocks fresh event handling too. Pick the earliestfireAtbefore sleeping.Also applies to: 219-220
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@components/ambient-control-plane/internal/informer/informer.go` around lines 158 - 175, The retryLoop currently sleeps on the first received retry (from inf.retryCh) which makes timing FIFO instead of deadline-based; change retryLoop to, after the initial receive of re from inf.retryCh, perform non-blocking reads of inf.retryCh to collect any queued retries, compute the earliest fireAt among them, and use that earliest fireAt to decide the wait; push the remaining retries into a local min-heap or reinsert them into the retry queue (or replace inf.retryCh with a priority queue stored on Informer) so dispatchHandler still receives retries in deadline order and senders to inf.retryCh do not block. Ensure the logic references Informer.retryLoop, inf.retryCh, re.fireAt and uses inf.dispatchHandler to process the selected retry.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@components/ambient-control-plane/internal/informer/informer.go`:
- Around line 158-175: The retryLoop is replaying stale events directly via
inf.dispatchHandler (function Informer.retryLoop), bypassing ordered dispatch
and current-cache checks; change retry handling to validate the resource is
still current or re-queue the retry into the ordered dispatcher instead of
calling dispatchHandler directly: in Informer.retryLoop, when pulling from
retryCh (re), either fetch the current object/revision from the informer
cache/store and compare its UID/resourceVersion (or equivalent) against re.event
metadata and only call dispatchHandler if it matches, or push re back into the
same channel/queue consumed by dispatchLoop so ordering and current-state checks
happen there (use identifiers like retryCh, dispatchCh/dispatchLoop,
dispatchHandler, and the informer cache/get method to locate where to implement
the gate).
---
Duplicate comments:
In `@components/ambient-control-plane/internal/informer/informer.go`:
- Around line 158-175: The retryLoop currently sleeps on the first received
retry (from inf.retryCh) which makes timing FIFO instead of deadline-based;
change retryLoop to, after the initial receive of re from inf.retryCh, perform
non-blocking reads of inf.retryCh to collect any queued retries, compute the
earliest fireAt among them, and use that earliest fireAt to decide the wait;
push the remaining retries into a local min-heap or reinsert them into the retry
queue (or replace inf.retryCh with a priority queue stored on Informer) so
dispatchHandler still receives retries in deadline order and senders to
inf.retryCh do not block. Ensure the logic references Informer.retryLoop,
inf.retryCh, re.fireAt and uses inf.dispatchHandler to process the selected
retry.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 380b4aaa-7657-4c80-92b4-b6ea3326de4e
📒 Files selected for processing (4)
components/ambient-cli/cmd/acpctl/apply/cmd.gocomponents/ambient-cli/demo-github.shcomponents/ambient-control-plane/internal/informer/informer.gocomponents/ambient-control-plane/internal/reconciler/kube_reconciler.go
🚧 Files skipped from review as they are similar to previous changes (1)
- components/ambient-control-plane/internal/reconciler/kube_reconciler.go
| func (inf *Informer) retryLoop(ctx context.Context) { | ||
| for { | ||
| select { | ||
| case <-ctx.Done(): | ||
| return | ||
| case re := <-inf.retryCh: | ||
| wait := time.Until(re.fireAt) | ||
| if wait > 0 { | ||
| timer := time.NewTimer(wait) | ||
| select { | ||
| case <-timer.C: | ||
| case <-ctx.Done(): | ||
| timer.Stop() | ||
| return | ||
| } | ||
| } | ||
| inf.dispatchHandler(ctx, re.event, re.handlerIndex, re.attempt) | ||
| } |
There was a problem hiding this comment.
Do not replay stale events out of order.
Line 174 runs the queued snapshot from retryLoop, outside the ordered dispatchLoop, with no check that the resource is still current. An older failed add/modify can therefore run after a newer delete/update and reapply stale state; for example, components/ambient-control-plane/internal/reconciler/kube_reconciler.go:122-175 provisions from event.Object.Session without re-fetching the session, so a deleted session can be reprovisioned on retry. Gate retries on the current cached revision/object, or funnel them back through the ordered dispatcher.
Also applies to: 204-220
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@components/ambient-control-plane/internal/informer/informer.go` around lines
158 - 175, The retryLoop is replaying stale events directly via
inf.dispatchHandler (function Informer.retryLoop), bypassing ordered dispatch
and current-cache checks; change retry handling to validate the resource is
still current or re-queue the retry into the ordered dispatcher instead of
calling dispatchHandler directly: in Informer.retryLoop, when pulling from
retryCh (re), either fetch the current object/revision from the informer
cache/store and compare its UID/resourceVersion (or equivalent) against re.event
metadata and only call dispatchHandler if it matches, or push re back into the
same channel/queue consumed by dispatchLoop so ordering and current-state checks
happen there (use identifiers like retryCh, dispatchCh/dispatchLoop,
dispatchHandler, and the informer cache/get method to locate where to implement
the gate).
Review Queue Status
Action needed: Resolve merge conflicts (rebase on main)
|
…nMessages Runner's BOT_TOKEN (OIDC JWT with preferred_username=GRPC_SERVICE_ACCOUNT) was failing ownership check because AMBIENT_API_TOKEN is not set, making IsServiceCaller always false. Read GRPC_SERVICE_ACCOUNT env var at startup and bypass ownership enforcement when the authenticated username matches. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
…resh loop OIDC client-credentials JWTs expire in minutes. The runner's BOT_TOKEN was injected only as a frozen env var, so reconnect-on-UNAUTHENTICATED re-read the same stale value in a tight loop. - Mount the session-<id>-creds Secret as a file volume at /var/run/secrets/ambient/bot-token so kubelet propagates updates and the runner's file-read on reconnect picks up fresh tokens - Add UpdateSecret to KubeClient - Add StartTokenRefreshLoop to SimpleKubeReconciler: every 10 minutes, list running sessions and update each creds secret with a fresh OIDC token - Wire the refresh loop into main.go runKubeMode for kube reconcilers 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
informer:
- Retry per failing handler (not whole event) to prevent exponential
fan-out when multiple handlers are registered for a resource
- Use time.NewTimer per entry instead of blocking sleep on dequeued item,
eliminating retry starvation when long-backoff entries precede short ones
- Add handler index to retryEvent; introduce dispatchHandler and
scheduleRetry helpers
control-plane:
- Token refresh loop now paginates all running sessions (100/page)
instead of capping at a single page of 500
cli/apply:
- strategicMerge now propagates Credential fields (provider, token,
url, email) so acpctl apply -k overlays work correctly
- buildCredentialPatch compares each field against the existing value
before marking changed, making repeated applies idempotent
demo-github.sh:
- Pass GitHub PAT via apply -f manifest with env-var expansion
($DEMO_GITHUB_PAT) rather than --token argv to avoid exposure in
ps/crash dumps; fetch credential ID with get credentials post-apply
- Look up credential:token-reader role (not credential:reader) so the
session can fetch tokens via GET /credentials/{id}/token
🤖 Generated with [Claude Code](https://claude.ai/code)
Co-Authored-By: Claude <noreply@anthropic.com>
f3fd99c to
85c7215
Compare
Summary
preferred_username=service-account-ocm-ams-serviceAMBIENT_API_TOKENis not set on the api-server deployment →IsServiceCalleris always falsePERMISSION_DENIED: not authorized to watch this sessionGRPC_SERVICE_ACCOUNTenv var at startup into a package-level var; bypass ownership enforcement when the authenticated username matches that valueTest plan
demo-github.sh— runner pod should connect to gRPC stream and execute the task without PERMISSION_DENIEDCompletedphase🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Bug Fixes