runtime: stabilize Jetson startup context by bittoby · Pull Request #106 · GeniePod/genie-claw

bittoby · 2026-05-18T20:12:07Z

Fixes #107

Summary

lower the default genie-ai-runtime context to 4096 with INT8 KV as the default service configuration
make start_all.sh wait for the configured LLM health endpoint before starting memory-heavy services
compact Genie AI Runtime requests to a small Jetson-friendly prompt while preserving tool rules and household context
route identity/memory recall questions directly to memory_recall

Testing

cargo fmt --check
cargo test -p genie-core tools::quick::tests
cargo test -p genie-core llm::openai_compat::tests
cargo test -p genie-core --test tool_dispatch_test
Jetson deploy + hard restart: runtime started with -c 4096 --int8-kv, Context: 4096 tokens, KV cache: 288 MB, FREE: 1104 MB
Jetson port 3000 chat: normal chat returned, streaming completed with token events and done
Jetson memory recall: What is my name? returned Your name is Jared with tool: memory_recall in 21 ms

Real Behavior Proof

I have built and run the affected code locally.
I have verified the change end-to-end on Jetson hardware.

ai-hpc

Closes #107 with option A, plus three structurally-right improvements that make the whole stack more robust at the new tighter context. Going in.

1. Context drop 8192 → 4096 with --int8-kv (deploy/systemd/genie-ai-runtime.service). Honest comment update on the unit explains why we're stepping back ("Keep the default at 4k because 8k can still fail to start on memory-fragmented full-stack restarts") rather than just silently changing the number. README narrative updated to match — the prior "Even 8192 context can already be tight" wording explicitly becomes "GenieClaw defaults to a 4096-token runtime context on this class of device because larger contexts can be too tight across full-stack restarts". Doc and config now agree.

2. start_all.sh waits for LLM /health before starting memory-heavy services — new wait_for_http_health helper polls curl -fsS --max-time 2 "$url" up to 180 times (3 min budget) before letting whisper / core / Home Assistant start. Right shape: the existing Before= systemd ordering from PR #76 makes systemd start them in order, but doesn't wait for the LLM unit to actually be ready to serve — it just waits for the unit to be active, which fires the moment ExecStart returns (and the LLM is still loading its model into iGPU memory at that point). The health-gate closes that gap, so the rest of the stack starts only after the runtime is actually answering. [services.llm].url is read from config with a fallback to http://127.0.0.1:8080/health if absent.

3. Compaction tuned for the 4k runtime (crates/genie-core/src/llm/openai_compat.rs). Three meaningful changes:

GENIE_RUNTIME_MAX_BODY_BYTES lowered 24KB → 4KB (and overhead 768 → 512). The 24KB threshold from PR #74 (later expanded by PR #87) was sized for the 8192-token runtime; at 4096 tokens it would overrun.
Compaction shape changed from "pass system messages through verbatim + retain N older user/assistant pairs" to structured rebuild: compact_genie_runtime_system emits a minimal system prompt with a generated tool-list (compact_genie_runtime_tool_lines), a generated rules block (compact_genie_runtime_rules), and a household-context tail (compact_household_context, capped at 900 bytes via truncate_utf8). The tool list is filtered to only the tools actually referenced in the source system prompt, so a chat-only deploy doesn't carry home_control text it can't act on. The rules block similarly adapts to what's available (Home control is unavailable; say Home Assistant is not connected if asked when HA is off).
The hardcoded GENIE_RUNTIME_COMPACT_SYSTEM blurb gets replaced by a richer prefix that explicitly tells the LLM the tool-call JSON contract ({"tool":"tool_name","arguments":{}}, no markdown). That eliminates a class of "model emits prose mentioning a tool name instead of a structured call" failures.

truncate_utf8 is a careful helper — drops to the previous char boundary if max_bytes lands mid-codepoint, then trim_ends. CJK / accented-char household contexts won't get sliced into invalid UTF-8.

Test updates correctly reflect the new shape — genie_runtime_profile_compacts_runtime_prompt_under_4k_budget (renamed from _under_expanded_budget) asserts prepared.compacted == true instead of false, asserts the new "GeniePod Home" + "memory_recall" + "What is my name?" survive while "tool manifest tool manifest" (the noise the compaction now drops) doesn't. The previous test pinned the old "do not compact under 24KB" behavior; the new test pins the new "compact under 4KB" behavior. Right reversal.

4. Identity-recall fast-path in crates/genie-core/src/tools/quick.rs. New memory_recall_query recognizes "what is my name", "do you remember my name", "who am i", "what do you remember about X", "search memory for X", and similar patterns, then dispatches to memory_recall without going through the LLM at all. Per the PR body, returns "Your name is Jared" in 21 ms on Jetson — bypasses an entire LLM round-trip for the most common identity questions. Two new unit tests pin both the name-form ("what is my name" → memory_recall { query: "name" }) and the search-form ("search memory for Jared" → memory_recall { query: "jared" }). The old does_not_route_memory_search_to_web test is correctly deleted since the new behavior actively does route it (just to memory_recall, not web_search).

Knock-on: this fix at the routing layer is also a robust answer to issue #85's "the LLM hallucinated 'Your name is GeniePod'" failure. Even on a runtime that doesn't see the tool manifest correctly, the identity question never reaches the LLM in the first place.

tool_dispatch_test.rs extended to pin the new start_all.sh health-wait shape (read_llm_url, wait_for_http_health, Configured LLM health) and the new context default (GENIEPOD_AI_RUNTIME_CONTEXT=4096). Both invariants are now CI-enforced.

End-to-end on Jetson per the PR body: runtime started with Context: 4096 tokens, KV cache: 288 MB, FREE: 1104 MB (vs the 1616 MB the 8k cold-boot measurement promised — but that benchmark was misleading for the steady-state case anyway, exactly the point of #107). Chat returns normally with streaming and done. Identity recall fires memory_recall in 21 ms.

All 8 CI checks green on 18bad34 (fmt, clippy, test, aarch64 cross-compile, --no-default-features, shellcheck, ruff, PR body checklist). Going in.

ai-hpc · 2026-05-18T20:25:49Z

Merged at 73b838bdff0427039daa618983a1a2773302931d.
Thanks to @bittoby!

bittoby added 3 commits May 18, 2026 14:00

fix(runtime): stabilize Jetson runtime startup

b5895e8

fix(runtime): compact core prompt for Jetson runtime

d78a1d2

fix(memory): route identity recall directly

18bad34

ai-hpc approved these changes May 18, 2026

View reviewed changes

ai-hpc merged commit 73b838b into GeniePod:main May 18, 2026
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

runtime: stabilize Jetson startup context#106

runtime: stabilize Jetson startup context#106
ai-hpc merged 3 commits into
GeniePod:mainfrom
bittoby:fix/runtime-default-context-4096

bittoby commented May 18, 2026 •

edited

Loading

Uh oh!

ai-hpc left a comment

Uh oh!

Uh oh!

ai-hpc commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

bittoby commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Testing

Real Behavior Proof

Uh oh!

ai-hpc left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ai-hpc commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

bittoby commented May 18, 2026 •

edited

Loading