runtime: stabilize Jetson startup context#106
Conversation
ai-hpc
left a comment
There was a problem hiding this comment.
Closes #107 with option A, plus three structurally-right improvements that make the whole stack more robust at the new tighter context. Going in.
1. Context drop 8192 → 4096 with --int8-kv (deploy/systemd/genie-ai-runtime.service). Honest comment update on the unit explains why we're stepping back ("Keep the default at 4k because 8k can still fail to start on memory-fragmented full-stack restarts") rather than just silently changing the number. README narrative updated to match — the prior "Even 8192 context can already be tight" wording explicitly becomes "GenieClaw defaults to a 4096-token runtime context on this class of device because larger contexts can be too tight across full-stack restarts". Doc and config now agree.
2. start_all.sh waits for LLM /health before starting memory-heavy services — new wait_for_http_health helper polls curl -fsS --max-time 2 "$url" up to 180 times (3 min budget) before letting whisper / core / Home Assistant start. Right shape: the existing Before= systemd ordering from PR #76 makes systemd start them in order, but doesn't wait for the LLM unit to actually be ready to serve — it just waits for the unit to be active, which fires the moment ExecStart returns (and the LLM is still loading its model into iGPU memory at that point). The health-gate closes that gap, so the rest of the stack starts only after the runtime is actually answering. [services.llm].url is read from config with a fallback to http://127.0.0.1:8080/health if absent.
3. Compaction tuned for the 4k runtime (crates/genie-core/src/llm/openai_compat.rs). Three meaningful changes:
GENIE_RUNTIME_MAX_BODY_BYTESlowered24KB → 4KB(and overhead768 → 512). The 24KB threshold from PR #74 (later expanded by PR #87) was sized for the 8192-token runtime; at 4096 tokens it would overrun.- Compaction shape changed from "pass system messages through verbatim + retain N older user/assistant pairs" to structured rebuild:
compact_genie_runtime_systememits a minimal system prompt with a generated tool-list (compact_genie_runtime_tool_lines), a generated rules block (compact_genie_runtime_rules), and a household-context tail (compact_household_context, capped at 900 bytes viatruncate_utf8). The tool list is filtered to only the tools actually referenced in the source system prompt, so a chat-only deploy doesn't carryhome_controltext it can't act on. The rules block similarly adapts to what's available (Home control is unavailable; say Home Assistant is not connected if askedwhen HA is off). - The hardcoded
GENIE_RUNTIME_COMPACT_SYSTEMblurb gets replaced by a richer prefix that explicitly tells the LLM the tool-call JSON contract ({"tool":"tool_name","arguments":{}}, no markdown). That eliminates a class of "model emits prose mentioning a tool name instead of a structured call" failures.
truncate_utf8 is a careful helper — drops to the previous char boundary if max_bytes lands mid-codepoint, then trim_ends. CJK / accented-char household contexts won't get sliced into invalid UTF-8.
Test updates correctly reflect the new shape — genie_runtime_profile_compacts_runtime_prompt_under_4k_budget (renamed from _under_expanded_budget) asserts prepared.compacted == true instead of false, asserts the new "GeniePod Home" + "memory_recall" + "What is my name?" survive while "tool manifest tool manifest" (the noise the compaction now drops) doesn't. The previous test pinned the old "do not compact under 24KB" behavior; the new test pins the new "compact under 4KB" behavior. Right reversal.
4. Identity-recall fast-path in crates/genie-core/src/tools/quick.rs. New memory_recall_query recognizes "what is my name", "do you remember my name", "who am i", "what do you remember about X", "search memory for X", and similar patterns, then dispatches to memory_recall without going through the LLM at all. Per the PR body, returns "Your name is Jared" in 21 ms on Jetson — bypasses an entire LLM round-trip for the most common identity questions. Two new unit tests pin both the name-form ("what is my name" → memory_recall { query: "name" }) and the search-form ("search memory for Jared" → memory_recall { query: "jared" }). The old does_not_route_memory_search_to_web test is correctly deleted since the new behavior actively does route it (just to memory_recall, not web_search).
Knock-on: this fix at the routing layer is also a robust answer to issue #85's "the LLM hallucinated 'Your name is GeniePod'" failure. Even on a runtime that doesn't see the tool manifest correctly, the identity question never reaches the LLM in the first place.
tool_dispatch_test.rs extended to pin the new start_all.sh health-wait shape (read_llm_url, wait_for_http_health, Configured LLM health) and the new context default (GENIEPOD_AI_RUNTIME_CONTEXT=4096). Both invariants are now CI-enforced.
End-to-end on Jetson per the PR body: runtime started with Context: 4096 tokens, KV cache: 288 MB, FREE: 1104 MB (vs the 1616 MB the 8k cold-boot measurement promised — but that benchmark was misleading for the steady-state case anyway, exactly the point of #107). Chat returns normally with streaming and done. Identity recall fires memory_recall in 21 ms.
All 8 CI checks green on 18bad34 (fmt, clippy, test, aarch64 cross-compile, --no-default-features, shellcheck, ruff, PR body checklist). Going in.
|
Merged at |
Fixes #107
Summary
start_all.shwait for the configured LLM health endpoint before starting memory-heavy servicesmemory_recallTesting
cargo fmt --checkcargo test -p genie-core tools::quick::testscargo test -p genie-core llm::openai_compat::testscargo test -p genie-core --test tool_dispatch_test-c 4096 --int8-kv,Context: 4096 tokens,KV cache: 288 MB,FREE: 1104 MBdoneWhat is my name?returnedYour name is Jaredwithtool: memory_recallin 21 msReal Behavior Proof