Skip to content

[bug] governor: swap_llm_model ignores systemctl exit status — memory-relief model swap silently no-ops (OOM risk) #148

@greatjourney589

Description

@greatjourney589

Summary

ServiceCtl::swap_llm_model runs systemctl daemon-reload and systemctl restart <unit> via .output().await? but never checks output.status.success() — unlike every other state-changing method in service_ctl.rs. If the restart is denied (polkit policy, masked unit, bad override), the governor's mode-driven LLM model swap silently no-ops while reporting success, so the heavier model keeps running and the device can OOM.

Steps to reproduce

  1. Run genie-governor on the device with mode transitions enabled (Day / NightA / NightB / Media).
  2. Arrange for the LLM unit restart to fail — e.g. the governor process lacks polkit rights to systemctl restart, the unit is masked, or the systemd override the function just wrote is rejected on reload.
  3. Trigger a mode transition that swaps the model, e.g. Day/NightA -> NightB (the memory-relief transition to a smaller model), Media -> *, or NightB -> Day (governor.rs:204-223).
  4. Observe: swap_llm_model returns Ok(()). The governor logs only the pre-action "swapping LLM model" info line. The model never actually changes.

Expected behavior

swap_llm_model should check status.success() on both the daemon-reload and restart commands, log the captured stderr on failure, and return Err(...) — matching start, docker_start, and enable_zram in the same file. A failed model swap during a memory-pressure transition must be observable (logged / surfaced), not reported as success.

Actual behavior

// crates/genie-governor/src/service_ctl.rs:109-120
// Reload systemd and restart the LLM service.
Command::new("systemctl")
    .args(["daemon-reload"])
    .output()
    .await?;          // <- exit status discarded

Command::new("systemctl")
    .args(["restart", &unit])
    .output()
    .await?;          // <- exit status discarded

Ok(())                // <- always reports success

.output().await? only propagates an error if the process fails to spawn. A non-zero systemctl exit (permission denied, masked unit, reload error) is swallowed. Compare with the sibling methods, which all branch on output.status.success():

  • start (service_ctl.rs:19-23) — checks status, logs stderr, bails.
  • docker_start (service_ctl.rs:81-85) — checks status, logs stderr, bails.
  • enable_zram (service_ctl.rs:136-139) — checks status, logs stderr.

swap_llm_model is the only state-changing method that ignores it. The impact is worst on the -> NightB memory-relief swap: if it silently fails, the larger daytime model stays resident on the 8 GB Orin overnight, defeating the governor's purpose and risking OOM. (Callers in governor.rs:207/215/222 additionally discard the result with let _ = ...await, so even after this fix the return value should be logged at the call site — but the function must first be capable of reporting failure.)

Hardware

Jetson Orin Nano Super 8 GB

JetPack / L4T version

No response

GenieClaw version / commit

maincrates/genie-governor/src/service_ctl.rs:91-121

Relevant logs

# All you ever see is the pre-action line; no error even when the restart was denied:
INFO genie_governor::service_ctl: swapping LLM model unit=genie-ai-runtime.service model=/opt/geniepod/models/<nightb-model>
# (no failure log, no error returned — model never actually swapped)

Additional context

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions