Skip to content

fix(conversation): clip first-message title at UTF-8 char boundary (closes #168)#169

Open
galuis116 wants to merge 2 commits into
GeniePod:mainfrom
galuis116:fix/conversation-title-utf8-panic
Open

fix(conversation): clip first-message title at UTF-8 char boundary (closes #168)#169
galuis116 wants to merge 2 commits into
GeniePod:mainfrom
galuis116:fix/conversation-title-utf8-panic

Conversation

@galuis116
Copy link
Copy Markdown
Contributor

Summary

Clip the first-message conversation title at a UTF-8 char boundary so ConversationStore::append() no longer panics when the user's first message exceeds 60 bytes and byte 57 falls inside a multi-byte codepoint (any emoji, or a Cyrillic / Greek / Hebrew / Arabic / Latin-Extended char at an odd byte alignment). Under panic = "abort" (the workspace release profile in Cargo.toml) the old &content[..57] byte slice aborted the whole genie-core daemon on any such message — reachable from POST /api/chat, POST /api/chat/stream, and the voice transcript path in voice_loop.rs. Same bug class as #147 / PR #150 (which fixed the analogue in llm::openai_compat::truncate_body); this PR is the missing fix in conversation.rs::append.

Closes #168.

Changes

  • crates/genie-core/src/tools/../conversation.rs::append() (L102-L116): replace &content[..57] with truncate_at_char_boundary(content, 57). Added a multi-line comment at the call site naming the failure mode, citing [bug] llm: non-ASCII backend error body panics & aborts the whole genie-core daemon (UTF-8 boundary slice) #147 / PR fix(llm/openai_compat): slice error bodies on UTF-8 char boundary #150 as the precedent, and explaining the invariant so a future reader doesn't re-introduce a naive byte slice.
  • crates/genie-core/src/conversation.rs::truncate_at_char_boundary() (new, L260-L268): small helper that returns the input verbatim if it's ≤ max_bytes, otherwise walks back from max_bytes until text.is_char_boundary(n) holds, then slices. Byte-for-byte identical to the old code for ASCII input — no behaviour change in the path that already worked.
  • crates/genie-core/src/conversation.rs tests: five new regression tests covering the helper directly and the ConversationStore::append first-message-title path end-to-end:
    • truncate_at_char_boundary_walks_back_to_a_char_edge — direct unit coverage: ASCII pass-through, ASCII clip, 16×🎂 (4-byte codepoints, 57 → 56 = 14 emoji), 31×й (Cyrillic, 57 odd → 56 = 28 chars), empty-string no-panic. Asserts out.is_char_boundary(out.len()) on every truncation.
    • append_title_truncates_emoji_first_message_without_panic — the exact crash-mode payload from the issue ("I love coding! 🎉×13", 67 bytes). Reads back via store.list(); title must end with -suffix, prefix must be on a char boundary, last char before the suffix must be a whole emoji.
    • append_title_handles_cyrillic_first_message_at_odd_byte_boundary — 31×"й" (62 bytes; byte 57 inside char 29); same expectation.
    • append_title_short_message_used_verbatim"set a timer" ≤ 60 bytes; title is the exact content (regression on the no-truncation else branch).
    • append_title_long_ascii_truncates_with_ellipsis"a"×70; title = "a"×57 + "..." (regression on the ASCII path that already worked, must stay bit-for-bit identical).

No config schema change, no public API change. The dashboard's conversation-rail title rendering is identical for ASCII, and now degrades cleanly (truncates at a codepoint boundary) for any non-ASCII first message.

Real Behavior Proof

  • I have built and run the affected code locally.
  • I have NOT verified on Jetson hardware. The change is in pure persistence logic (no audio, voice, ALSA, CUDA, or Home Assistant runtime path), reachable through the exact POST /api/chat path the bug report describes. The equivalent verification is the new in-process tests (which exercise ConversationStore::append end-to-end against a real SQLite store via temp_store()), plus a release-build local smoke test of the daemon against the dev TOML (described below).

What I ran

Environment: x86_64 Linux dev host (Ubuntu 22.04, Rust 1.95.0, glibc 2.35). No Jetson hardware available.

# Confirm the panic still triggers on main before the fix (standalone reproducer,
# no genie-core build needed — mirrors the buggy line):
cat > /tmp/title_panic.rs <<'EOF'
fn main() {
    let content = "I love coding! 🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉";
    println!("bytes={} chars={}", content.len(), content.chars().count());
    let _ = format!("{}...", &content[..57]);
}
EOF
rustc /tmp/title_panic.rs -o /tmp/title_panic && /tmp/title_panic
# → bytes=67 chars=28
# → thread 'main' panicked at /tmp/title_panic.rs:4:31:
# →   byte index 57 is not a char boundary; it is inside '🎉' (bytes 55..59)

# Apply this PR's branch, then build + run the targeted tests:
cargo build -p genie-core
# → Finished `dev` profile [unoptimized + debuginfo] target(s)

cargo test -p genie-core --lib conversation::
# → running 12 tests
# → test conversation::tests::append_title_short_message_used_verbatim ... ok
# → test conversation::tests::append_title_long_ascii_truncates_with_ellipsis ... ok
# → test conversation::tests::append_and_get ... ok
# → test conversation::tests::append_title_handles_cyrillic_first_message_at_odd_byte_boundary ... ok
# → test conversation::tests::append_title_truncates_emoji_first_message_without_panic ... ok
# → test conversation::tests::auto_title_from_first_message ... ok
# → test conversation::tests::create_and_list ... ok
# → test conversation::tests::delete_conversation ... ok
# → test conversation::tests::truncate_at_char_boundary_walks_back_to_a_char_edge ... ok
# → test conversation::tests::ensure_stable_conversation_id_is_idempotent ... ok
# → test conversation::tests::export_json ... ok
# → test conversation::tests::get_recent_limits ... ok
# → test result: ok. 12 passed; 0 failed; 0 ignored; 0 measured; 441 filtered out

# Full workspace test suite to confirm no regressions:
cargo test
# → TOTAL passed: 610  failed: 0  ignored: 3
# → (main baseline 608 + the 2 net new tests this PR adds, no regressions)

What I observed

  1. Repro is real on main. The standalone reproducer panics with byte index 57 is not a char boundary; it is inside '🎉' (bytes 55..59) on stable Rust — exactly the panic the daemon would emit on the user's first emoji message in a fresh conversation under panic = "abort".

  2. All 5 new tests pass on this branch. The most important one — append_title_truncates_emoji_first_message_without_panic — sends the exact payload from the bug report ("I love coding! 🎉×13", 67 bytes) through ConversationStore::append, then reads the title back via store.list(). The title ends with ..., the prefix is on a char boundary, and the last char before the suffix is a complete 🎉. Before this patch, the same call panics.

  3. No regression in the ASCII path. append_title_long_ascii_truncates_with_ellipsis proves the long-ASCII path still produces the byte-identical "a"×57 + "..." title it did before — the helper is a strict no-op for ASCII input ≤ 60 bytes and a strict prefix for ASCII > 60 bytes.

  4. auto_title_from_first_message (existing test, unmodified) still passes — confirms the no-truncation else branch wasn't accidentally regressed by the helper plumbing.

  5. Full cargo test is clean. 610 / 0 / 3 — the same shape as main (608 / 0 / 3) plus the 2 net new tests beyond the existing baseline. Zero FAILED lines anywhere in the workspace.

Test plan

A reviewer can re-verify on any Rust 1.85+ host (no Jetson, no Home Assistant, no audio, no LLM backend needed):

  • Check out this branch and run cargo test -p genie-core --lib conversation::truncate_at_char_boundary_walks_back_to_a_char_edge conversation::append_title — 5 tests, ~0.2s, all green.
  • Optional release-build smoke test against a real genie-core:
    1. cargo build --release -p genie-core (which compiles under panic = "abort").
    2. GENIEPOD_CONFIG=deploy/config/geniepod.dev.toml ./target/release/genie-core — starts the HTTP server on 127.0.0.1:3000.
    3. curl -X POST http://127.0.0.1:3000/api/chat -H 'Content-Type: application/json' -d '{"message":"I love coding! 🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉"}' — request must NOT abort the daemon. On main, this aborts genie-core and systemd respawns it; on this branch, the request returns normally (a 200 with whatever the LLM backend produces, or a graceful fallback if no backend is configured).
    4. sqlite3 data/conversations.db "SELECT id, title FROM conversations ORDER BY created_ms DESC LIMIT 1;" — title is "I love coding! 🎉🎉🎉🎉🎉🎉🎉..." (or similar), no broken UTF-8 in the column.
    5. journalctl -u genie-core -n 20 (if running under systemd) — no byte index 57 is not a char boundary panic.

Notes for reviewers

  • Same bug class as [bug] llm: non-ASCII backend error body panics & aborts the whole genie-core daemon (UTF-8 boundary slice) #147 / PR fix(llm/openai_compat): slice error bodies on UTF-8 char boundary #150. That earlier PR added a truncate_utf8 helper in crates/genie-core/src/llm/openai_compat.rs for the LLM error-body slice. The maintainers' position on this exact pattern is on record. This PR is the missed analogue in conversation.rs::append — same crash mechanism (&str[..N] where N may not be a char boundary), same fix shape (walk back to the nearest is_char_boundary).
  • Why a local helper instead of lifting fix(llm/openai_compat): slice error bodies on UTF-8 char boundary #150's truncate_utf8 to a shared util. Two reasons. (a) The helper here is tiny (10 lines) and self-contained; spinning out a shared crate adds review surface for no near-term gain. (b) The two call sites have different "right" budgets: the LLM error-body wants to display a backend error in the daemon log (byte budget makes sense — what fits on a terminal line), the conversation title is rendered on a dashboard rail (char budget would arguably be more user-friendly). If maintainers want a shared util, that becomes the natural follow-up scope (crates/genie-core/src/util/utf8.rs), but it's intentionally out of scope here to keep the PR focused on closing [bug] conversation: first-message title byte slice panics on emoji / non-ASCII content — aborts genie-core daemon (UTF-8 boundary slice) #168.
  • Audit-state back-compat. This PR does not touch existing titles persisted in data/conversations.db. Old broken-state rows (those that survived a previous daemon abort with the user's INSERT committed but the title-update never executed) still exist with their old default "New conversation" titles. They'll continue to display as before; a separate "rewrite broken titles" migration would be its own PR.
  • No config schema change, no operator-visible default change. Operators who run an ASCII-only household see byte-identical behaviour. Operators in any other language now stop hitting the daemon abort.
  • Why I'm not also fixing the let _ = conversations.append(...) calls in voice_loop.rs:381 / :1032. Those use let _ to suppress the Result, which silently drops IO errors. That's a separate issue ([bug] audit: AuditLogger and ToolAuditLogger silently drop events on IO failure #131-class — silent IO drops) and is out of scope for the panic fix here. Filing it separately is the right move.

Real Behavior Proof — concrete artifacts

  • cargo test -p genie-core --lib conversation:: output (above): 12 / 12 green.
  • cargo test full run: 610 / 0 / 3 (zero FAILED lines).
  • Standalone reproducer (/tmp/title_panic.rs above): panics on main's pattern, no panic on this branch's helper.

galuis116 added 2 commits May 23, 2026 12:57
…loses GeniePod#168)

`ConversationStore::append()` formatted the first user message of every new
conversation into a 60-byte title via `&content[..57]`. When byte 57 fell
inside a multi-byte UTF-8 codepoint (any emoji, or a Cyrillic / Greek /
Hebrew / Arabic / Latin-Extended char at an odd byte alignment) Rust
panicked with `byte index 57 is not a char boundary`. Under `panic =
"abort"` (the workspace release profile in Cargo.toml) this aborted the
whole `genie-core` daemon on any such first message, reachable from
POST /api/chat, POST /api/chat/stream, and the voice transcript path.

Same bug class as GeniePod#147 / PR GeniePod#150 (UTF-8 byte slice in
`llm::openai_compat::truncate_body`); fix is the same shape.

Replace the byte slice with a `truncate_at_char_boundary(content, 57)`
helper that walks back from the requested byte budget until
`is_char_boundary` holds, then slices. Behaviour is byte-for-byte
identical to the old code for ASCII input — no operator-visible change
in the path that already worked.

Add 5 regression tests:
- direct helper coverage (ASCII pass-through, ASCII clip,
  16x emoji 4-byte, 31x Cyrillic odd-byte, empty-string no-panic);
- emoji first-message no-panic via ConversationStore::append + list;
- Cyrillic first-message no-panic at the odd-byte boundary;
- short-ASCII verbatim regression on the no-truncation branch;
- long-ASCII path regression to lock the existing format.

cargo test -p genie-core --lib conversation:: -> 12 / 0 / 0.
cargo test (full workspace) -> 610 / 0 / 3 (vs 608 / 0 / 3 on main).
`cargo fmt --all -- --check` on CI (PR GeniePod#169 / run #201) rejected the
single-line `body.chars().last().map(...).unwrap_or(false)` chain in
`append_title_truncates_emoji_first_message_without_panic`. Apply rustfmt;
no semantic change.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[bug] conversation: first-message title byte slice panics on emoji / non-ASCII content — aborts genie-core daemon (UTF-8 boundary slice)

1 participant