Summary
ServiceCtl::swap_llm_model runs systemctl daemon-reload and systemctl restart <unit> via .output().await? but never checks output.status.success() — unlike every other state-changing method in service_ctl.rs. If the restart is denied (polkit policy, masked unit, bad override), the governor's mode-driven LLM model swap silently no-ops while reporting success, so the heavier model keeps running and the device can OOM.
Steps to reproduce
- Run
genie-governor on the device with mode transitions enabled (Day / NightA / NightB / Media).
- Arrange for the LLM unit restart to fail — e.g. the governor process lacks polkit rights to
systemctl restart, the unit is masked, or the systemd override the function just wrote is rejected on reload.
- Trigger a mode transition that swaps the model, e.g.
Day/NightA -> NightB (the memory-relief transition to a smaller model), Media -> *, or NightB -> Day (governor.rs:204-223).
- Observe:
swap_llm_model returns Ok(()). The governor logs only the pre-action "swapping LLM model" info line. The model never actually changes.
Expected behavior
swap_llm_model should check status.success() on both the daemon-reload and restart commands, log the captured stderr on failure, and return Err(...) — matching start, docker_start, and enable_zram in the same file. A failed model swap during a memory-pressure transition must be observable (logged / surfaced), not reported as success.
Actual behavior
// crates/genie-governor/src/service_ctl.rs:109-120
// Reload systemd and restart the LLM service.
Command::new("systemctl")
.args(["daemon-reload"])
.output()
.await?; // <- exit status discarded
Command::new("systemctl")
.args(["restart", &unit])
.output()
.await?; // <- exit status discarded
Ok(()) // <- always reports success
.output().await? only propagates an error if the process fails to spawn. A non-zero systemctl exit (permission denied, masked unit, reload error) is swallowed. Compare with the sibling methods, which all branch on output.status.success():
start (service_ctl.rs:19-23) — checks status, logs stderr, bails.
docker_start (service_ctl.rs:81-85) — checks status, logs stderr, bails.
enable_zram (service_ctl.rs:136-139) — checks status, logs stderr.
swap_llm_model is the only state-changing method that ignores it. The impact is worst on the -> NightB memory-relief swap: if it silently fails, the larger daytime model stays resident on the 8 GB Orin overnight, defeating the governor's purpose and risking OOM. (Callers in governor.rs:207/215/222 additionally discard the result with let _ = ...await, so even after this fix the return value should be logged at the call site — but the function must first be capable of reporting failure.)
Hardware
Jetson Orin Nano Super 8 GB
JetPack / L4T version
No response
GenieClaw version / commit
main — crates/genie-governor/src/service_ctl.rs:91-121
Relevant logs
# All you ever see is the pre-action line; no error even when the restart was denied:
INFO genie_governor::service_ctl: swapping LLM model unit=genie-ai-runtime.service model=/opt/geniepod/models/<nightb-model>
# (no failure log, no error returned — model never actually swapped)
Additional context
Summary
ServiceCtl::swap_llm_modelrunssystemctl daemon-reloadandsystemctl restart <unit>via.output().await?but never checksoutput.status.success()— unlike every other state-changing method inservice_ctl.rs. If the restart is denied (polkit policy, masked unit, bad override), the governor's mode-driven LLM model swap silently no-ops while reporting success, so the heavier model keeps running and the device can OOM.Steps to reproduce
genie-governoron the device with mode transitions enabled (Day / NightA / NightB / Media).systemctl restart, the unit is masked, or the systemd override the function just wrote is rejected on reload.Day/NightA -> NightB(the memory-relief transition to a smaller model),Media -> *, orNightB -> Day(governor.rs:204-223).swap_llm_modelreturnsOk(()). The governor logs only the pre-action"swapping LLM model"info line. The model never actually changes.Expected behavior
swap_llm_modelshould checkstatus.success()on both thedaemon-reloadandrestartcommands, log the capturedstderron failure, and returnErr(...)— matchingstart,docker_start, andenable_zramin the same file. A failed model swap during a memory-pressure transition must be observable (logged / surfaced), not reported as success.Actual behavior
.output().await?only propagates an error if the process fails to spawn. A non-zerosystemctlexit (permission denied, masked unit, reload error) is swallowed. Compare with the sibling methods, which all branch onoutput.status.success():start(service_ctl.rs:19-23) — checks status, logs stderr, bails.docker_start(service_ctl.rs:81-85) — checks status, logs stderr, bails.enable_zram(service_ctl.rs:136-139) — checks status, logs stderr.swap_llm_modelis the only state-changing method that ignores it. The impact is worst on the-> NightBmemory-relief swap: if it silently fails, the larger daytime model stays resident on the 8 GB Orin overnight, defeating the governor's purpose and risking OOM. (Callers ingovernor.rs:207/215/222additionally discard the result withlet _ = ...await, so even after this fix the return value should be logged at the call site — but the function must first be capable of reporting failure.)Hardware
Jetson Orin Nano Super 8 GB
JetPack / L4T version
No response
GenieClaw version / commit
main—crates/genie-governor/src/service_ctl.rs:91-121Relevant logs
Additional context
governor/swap_llm_model/daemon-reload/model swap/service_ctl— the matches are config-resolution / default-model / deploy-pipeline topics (runtime: resolve llm service from config #40, Default to genie-ai-runtime v1.0.0 (replaces stale #33) #52, Build + install pipeline for genie-ai-runtime v1.0.0 (ship genie-ai-runtime.service alongside genie-llm.service) #54, GENIEPOD_AI_RUNTIME_CONTEXT=8192 destabilizes the stack under steady-state load (chat UI lag + swap engagement) #107, …); none describe the dropped exit status. Related but distinct from GENIEPOD_AI_RUNTIME_CONTEXT=8192 destabilizes the stack under steady-state load (chat UI lag + swap engagement) #107 (context size destabilizes the stack).output, branch onstatus.success(), logstderr, andbail!on failure for both commands — then log the result at thegovernor.rscall sites instead oflet _ =.