Commit 896aaa1
fix(runner,operator): improve session backup and resume reliability (#1057)
## Summary
Three issues caused data loss and broken session resume when the runner
container OOMed:
- **Repo data lost on OOM**: `backup_git_repos()` only ran on SIGTERM.
Runner OOM meant no SIGTERM → no repo backup. Now runs periodically
every ~5 min.
- **Session resume broken**: CLI session ID was only persisted after
turn completion. OOM during first turn → no session ID in S3 →
`--resume` never passed on restart → agent lost all context. Now
persists immediately on CLI init.
- **Operator didn't detect OOM**: Pod watch predicate only checked pod
phase, which stays `Running` when one container dies (sidecar still
alive). OOMKilled pods sat indefinitely. Now triggers on container
termination too.
## Changes
| File | Change |
|------|--------|
| `state-sync/sync.sh` | Periodic `backup_git_repos` every Nth sync
cycle (configurable `REPO_BACKUP_INTERVAL`, default 5) |
| `session.py` | `on_session_id` callback persists session ID to disk
immediately on CLI init message |
| `bridge.py` | Resume failure detection + removed redundant post-turn
persist |
| `agenticsession_controller.go` | Pod watch predicate triggers on
container termination, not just pod phase |
## Root cause investigation
Diagnosed from live OOMKilled pods in `bug-bash-mturley` namespace on
UAT:
- Confirmed `ambient-code-runner` container OOMKilled (exit 137, 8Gi
limit)
- State-sync kept running but never backed up repos (only on SIGTERM)
- Session JSONL was valid (237 lines, all valid JSON) and `--resume`
works fine on sessions with dangling tool calls (verified via SDK)
- The actual cause: `claude_session_ids.json` never existed in S3
because the runner OOMed during its first turn, before
`_persist_session_ids()` was called
- Operator pod watch predicate filtered out the event because pod phase
stayed `Running`
## Test plan
- [x] Runner tests pass (488 passed, 11 skipped)
- [x] Operator builds clean (`go vet`, `go build`)
- [ ] Deploy to kind cluster and verify periodic repo backup runs
- [ ] Simulate runner OOM and verify operator detects container
termination
- [ ] Verify session resume works after OOM recovery
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-authored-by: Ambient Code Bot <bot@ambient-code.local>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>1 parent 7158005 commit 896aaa1
File tree
4 files changed
+66
-8
lines changed- components
- operator/internal/controller
- runners
- ambient-runner/ambient_runner/bridges/claude
- state-sync
4 files changed
+66
-8
lines changedLines changed: 20 additions & 2 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
261 | 261 | | |
262 | 262 | | |
263 | 263 | | |
264 | | - | |
265 | | - | |
| 264 | + | |
| 265 | + | |
| 266 | + | |
| 267 | + | |
| 268 | + | |
| 269 | + | |
| 270 | + | |
| 271 | + | |
| 272 | + | |
| 273 | + | |
| 274 | + | |
| 275 | + | |
| 276 | + | |
| 277 | + | |
| 278 | + | |
| 279 | + | |
| 280 | + | |
| 281 | + | |
| 282 | + | |
| 283 | + | |
266 | 284 | | |
267 | 285 | | |
268 | 286 | | |
| |||
Lines changed: 15 additions & 4 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
178 | 178 | | |
179 | 179 | | |
180 | 180 | | |
181 | | - | |
182 | | - | |
183 | | - | |
184 | | - | |
| 181 | + | |
| 182 | + | |
| 183 | + | |
| 184 | + | |
| 185 | + | |
| 186 | + | |
| 187 | + | |
| 188 | + | |
| 189 | + | |
| 190 | + | |
| 191 | + | |
| 192 | + | |
| 193 | + | |
| 194 | + | |
| 195 | + | |
185 | 196 | | |
186 | 197 | | |
187 | 198 | | |
| |||
Lines changed: 22 additions & 2 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
31 | 31 | | |
32 | 32 | | |
33 | 33 | | |
34 | | - | |
| 34 | + | |
35 | 35 | | |
36 | 36 | | |
37 | 37 | | |
| |||
67 | 67 | | |
68 | 68 | | |
69 | 69 | | |
| 70 | + | |
70 | 71 | | |
71 | 72 | | |
72 | 73 | | |
73 | 74 | | |
| 75 | + | |
74 | 76 | | |
75 | 77 | | |
76 | 78 | | |
| |||
140 | 142 | | |
141 | 143 | | |
142 | 144 | | |
| 145 | + | |
| 146 | + | |
| 147 | + | |
| 148 | + | |
| 149 | + | |
| 150 | + | |
| 151 | + | |
143 | 152 | | |
144 | 153 | | |
145 | 154 | | |
| |||
289 | 298 | | |
290 | 299 | | |
291 | 300 | | |
292 | | - | |
| 301 | + | |
| 302 | + | |
| 303 | + | |
293 | 304 | | |
294 | 305 | | |
295 | 306 | | |
| |||
345 | 356 | | |
346 | 357 | | |
347 | 358 | | |
| 359 | + | |
| 360 | + | |
| 361 | + | |
| 362 | + | |
| 363 | + | |
| 364 | + | |
| 365 | + | |
| 366 | + | |
| 367 | + | |
348 | 368 | | |
349 | 369 | | |
350 | 370 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
10 | 10 | | |
11 | 11 | | |
12 | 12 | | |
| 13 | + | |
13 | 14 | | |
14 | 15 | | |
15 | 16 | | |
| |||
261 | 262 | | |
262 | 263 | | |
263 | 264 | | |
| 265 | + | |
264 | 266 | | |
265 | 267 | | |
266 | 268 | | |
| |||
283 | 285 | | |
284 | 286 | | |
285 | 287 | | |
| 288 | + | |
286 | 289 | | |
287 | 290 | | |
| 291 | + | |
| 292 | + | |
| 293 | + | |
| 294 | + | |
| 295 | + | |
| 296 | + | |
288 | 297 | | |
289 | 298 | | |
290 | 299 | | |
0 commit comments