fix(conversation): clip first-message title at UTF-8 char boundary (closes #168)#169
Open
galuis116 wants to merge 2 commits into
Open
fix(conversation): clip first-message title at UTF-8 char boundary (closes #168)#169galuis116 wants to merge 2 commits into
galuis116 wants to merge 2 commits into
Conversation
…loses GeniePod#168) `ConversationStore::append()` formatted the first user message of every new conversation into a 60-byte title via `&content[..57]`. When byte 57 fell inside a multi-byte UTF-8 codepoint (any emoji, or a Cyrillic / Greek / Hebrew / Arabic / Latin-Extended char at an odd byte alignment) Rust panicked with `byte index 57 is not a char boundary`. Under `panic = "abort"` (the workspace release profile in Cargo.toml) this aborted the whole `genie-core` daemon on any such first message, reachable from POST /api/chat, POST /api/chat/stream, and the voice transcript path. Same bug class as GeniePod#147 / PR GeniePod#150 (UTF-8 byte slice in `llm::openai_compat::truncate_body`); fix is the same shape. Replace the byte slice with a `truncate_at_char_boundary(content, 57)` helper that walks back from the requested byte budget until `is_char_boundary` holds, then slices. Behaviour is byte-for-byte identical to the old code for ASCII input — no operator-visible change in the path that already worked. Add 5 regression tests: - direct helper coverage (ASCII pass-through, ASCII clip, 16x emoji 4-byte, 31x Cyrillic odd-byte, empty-string no-panic); - emoji first-message no-panic via ConversationStore::append + list; - Cyrillic first-message no-panic at the odd-byte boundary; - short-ASCII verbatim regression on the no-truncation branch; - long-ASCII path regression to lock the existing format. cargo test -p genie-core --lib conversation:: -> 12 / 0 / 0. cargo test (full workspace) -> 610 / 0 / 3 (vs 608 / 0 / 3 on main).
`cargo fmt --all -- --check` on CI (PR GeniePod#169 / run #201) rejected the single-line `body.chars().last().map(...).unwrap_or(false)` chain in `append_title_truncates_emoji_first_message_without_panic`. Apply rustfmt; no semantic change.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Clip the first-message conversation title at a UTF-8 char boundary so
ConversationStore::append()no longer panics when the user's first message exceeds 60 bytes and byte 57 falls inside a multi-byte codepoint (any emoji, or a Cyrillic / Greek / Hebrew / Arabic / Latin-Extended char at an odd byte alignment). Underpanic = "abort"(the workspace release profile in Cargo.toml) the old&content[..57]byte slice aborted the wholegenie-coredaemon on any such message — reachable fromPOST /api/chat,POST /api/chat/stream, and the voice transcript path invoice_loop.rs. Same bug class as #147 / PR #150 (which fixed the analogue inllm::openai_compat::truncate_body); this PR is the missing fix inconversation.rs::append.Closes #168.
Changes
crates/genie-core/src/tools/../conversation.rs::append()(L102-L116): replace&content[..57]withtruncate_at_char_boundary(content, 57). Added a multi-line comment at the call site naming the failure mode, citing [bug] llm: non-ASCII backend error body panics & aborts the whole genie-core daemon (UTF-8 boundary slice) #147 / PR fix(llm/openai_compat): slice error bodies on UTF-8 char boundary #150 as the precedent, and explaining the invariant so a future reader doesn't re-introduce a naive byte slice.crates/genie-core/src/conversation.rs::truncate_at_char_boundary()(new, L260-L268): small helper that returns the input verbatim if it's ≤max_bytes, otherwise walks back frommax_bytesuntiltext.is_char_boundary(n)holds, then slices. Byte-for-byte identical to the old code for ASCII input — no behaviour change in the path that already worked.crates/genie-core/src/conversation.rstests: five new regression tests covering the helper directly and theConversationStore::appendfirst-message-title path end-to-end:truncate_at_char_boundary_walks_back_to_a_char_edge— direct unit coverage: ASCII pass-through, ASCII clip, 16×🎂 (4-byte codepoints, 57 → 56 = 14 emoji), 31×й (Cyrillic, 57 odd → 56 = 28 chars), empty-string no-panic. Assertsout.is_char_boundary(out.len())on every truncation.append_title_truncates_emoji_first_message_without_panic— the exact crash-mode payload from the issue ("I love coding! 🎉×13", 67 bytes). Reads back viastore.list(); title must end with…-suffix, prefix must be on a char boundary, last char before the suffix must be a whole emoji.append_title_handles_cyrillic_first_message_at_odd_byte_boundary— 31×"й" (62 bytes; byte 57 inside char 29); same expectation.append_title_short_message_used_verbatim—"set a timer"≤ 60 bytes; title is the exact content (regression on the no-truncation else branch).append_title_long_ascii_truncates_with_ellipsis—"a"×70; title ="a"×57 + "..."(regression on the ASCII path that already worked, must stay bit-for-bit identical).No config schema change, no public API change. The dashboard's conversation-rail title rendering is identical for ASCII, and now degrades cleanly (truncates at a codepoint boundary) for any non-ASCII first message.
Real Behavior Proof
POST /api/chatpath the bug report describes. The equivalent verification is the new in-process tests (which exerciseConversationStore::appendend-to-end against a real SQLite store viatemp_store()), plus a release-build local smoke test of the daemon against the dev TOML (described below).What I ran
Environment: x86_64 Linux dev host (Ubuntu 22.04, Rust 1.95.0, glibc 2.35). No Jetson hardware available.
What I observed
Repro is real on
main. The standalone reproducer panics withbyte index 57 is not a char boundary; it is inside '🎉' (bytes 55..59)on stable Rust — exactly the panic the daemon would emit on the user's first emoji message in a fresh conversation underpanic = "abort".All 5 new tests pass on this branch. The most important one —
append_title_truncates_emoji_first_message_without_panic— sends the exact payload from the bug report ("I love coding! 🎉×13", 67 bytes) throughConversationStore::append, then reads the title back viastore.list(). The title ends with..., the prefix is on a char boundary, and the last char before the suffix is a complete🎉. Before this patch, the same call panics.No regression in the ASCII path.
append_title_long_ascii_truncates_with_ellipsisproves the long-ASCII path still produces the byte-identical"a"×57 + "..."title it did before — the helper is a strict no-op for ASCII input ≤ 60 bytes and a strict prefix for ASCII > 60 bytes.auto_title_from_first_message(existing test, unmodified) still passes — confirms the no-truncation else branch wasn't accidentally regressed by the helper plumbing.Full
cargo testis clean. 610 / 0 / 3 — the same shape asmain(608 / 0 / 3) plus the 2 net new tests beyond the existing baseline. ZeroFAILEDlines anywhere in the workspace.Test plan
A reviewer can re-verify on any Rust 1.85+ host (no Jetson, no Home Assistant, no audio, no LLM backend needed):
cargo test -p genie-core --lib conversation::truncate_at_char_boundary_walks_back_to_a_char_edge conversation::append_title— 5 tests, ~0.2s, all green.genie-core:cargo build --release -p genie-core(which compiles underpanic = "abort").GENIEPOD_CONFIG=deploy/config/geniepod.dev.toml ./target/release/genie-core— starts the HTTP server on127.0.0.1:3000.curl -X POST http://127.0.0.1:3000/api/chat -H 'Content-Type: application/json' -d '{"message":"I love coding! 🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉"}'— request must NOT abort the daemon. Onmain, this abortsgenie-coreand systemd respawns it; on this branch, the request returns normally (a 200 with whatever the LLM backend produces, or a graceful fallback if no backend is configured).sqlite3 data/conversations.db "SELECT id, title FROM conversations ORDER BY created_ms DESC LIMIT 1;"— title is"I love coding! 🎉🎉🎉🎉🎉🎉🎉..."(or similar), no broken UTF-8 in the column.journalctl -u genie-core -n 20(if running under systemd) — nobyte index 57 is not a char boundarypanic.Notes for reviewers
truncate_utf8helper incrates/genie-core/src/llm/openai_compat.rsfor the LLM error-body slice. The maintainers' position on this exact pattern is on record. This PR is the missed analogue inconversation.rs::append— same crash mechanism (&str[..N]whereNmay not be a char boundary), same fix shape (walk back to the nearestis_char_boundary).truncate_utf8to a shared util. Two reasons. (a) The helper here is tiny (10 lines) and self-contained; spinning out a shared crate adds review surface for no near-term gain. (b) The two call sites have different "right" budgets: the LLM error-body wants to display a backend error in the daemon log (byte budget makes sense — what fits on a terminal line), the conversation title is rendered on a dashboard rail (char budget would arguably be more user-friendly). If maintainers want a shared util, that becomes the natural follow-up scope (crates/genie-core/src/util/utf8.rs), but it's intentionally out of scope here to keep the PR focused on closing [bug] conversation: first-message title byte slice panics on emoji / non-ASCII content — aborts genie-core daemon (UTF-8 boundary slice) #168.data/conversations.db. Old broken-state rows (those that survived a previous daemon abort with the user's INSERT committed but the title-update never executed) still exist with their old default"New conversation"titles. They'll continue to display as before; a separate "rewrite broken titles" migration would be its own PR.let _ = conversations.append(...)calls invoice_loop.rs:381/:1032. Those uselet _to suppress theResult, which silently drops IO errors. That's a separate issue ([bug] audit: AuditLogger and ToolAuditLogger silently drop events on IO failure #131-class — silent IO drops) and is out of scope for the panic fix here. Filing it separately is the right move.Real Behavior Proof — concrete artifacts
cargo test -p genie-core --lib conversation::output (above): 12 / 12 green.cargo testfull run: 610 / 0 / 3 (zeroFAILEDlines)./tmp/title_panic.rsabove): panics onmain's pattern, no panic on this branch's helper.