fix(rewrite): correct multibyte UTF-8 compound rewrites (#383)#395
Merged
Conversation
…UTF-8 (#383) rable's AST spans are character indices (the lexer scans a `Vec<char>`), but `compound_segments`/`flatten_segments` sliced the byte-indexed source string by raw span values. For compound commands containing multibyte UTF-8 (e.g. `echo "✓ valid" || …; … | grep x`) the `&&`/`||`/`;` separators were sliced at the wrong byte offsets, emitting invalid shell (`| ||`, a dropped `; `) that failed to parse — corrupting every hook-driven rewrite that mixed an emoji/checkmark echo with a pipe. Add a `char_to_byte` helper (mirroring rable's own `Node::source_text` conversion) and route the two separator-slice sites through it. Regression tests at three layers: AST round-trip + exact separators (new `bash_ast_multibyte_tests`, split out to stay under the 700-line file limit), in-process rewrite (`tests_compound`), and the real binary an LLM hook invokes (`cli_rewrite`). NOTE: depends on the rable span-end fix (mpecan/rable#72). The byte/char conversion here is only correct once rable reports consistent character spans; the `rable` dependency bump to the released, fixed version is a follow-up commit. Until then the new tests fail against rable 0.1.15. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…383) rable 0.2.1 reports AST spans as consistent character indices (the char-pos + byte-len span-end bug was fixed in mpecan/rable#72). This is what makes the preceding `char_to_byte` separator-slice fix correct, so the multibyte regression suite now passes. Also absorbs the rable 0.2.0 API surface changes (WordSpan relocation, tightened lexer API) — tokf only uses `rable::parse` and `rable::ast`, which remain compatible. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Contributor
Filter Verification ReportChanged FiltersNo filter files changed in this PR. All Filters Summary✅ 143/143 test cases passed across 51 filters Generated by |
Merged
mpecan
pushed a commit
that referenced
this pull request
Jun 20, 2026
🤖 I have created a release *beep* *boop* --- <details><summary>e2e-tests: 0.1.37</summary> ### Dependencies </details> <details><summary>tokf-common: 0.2.49</summary> ## [0.2.49](tokf-common-v0.2.48...tokf-common-v0.2.49) (2026-06-20) ### Miscellaneous * **tokf-common:** Synchronize workspace versions </details> <details><summary>tokf-filter: 0.2.49</summary> ## [0.2.49](tokf-filter-v0.2.48...tokf-filter-v0.2.49) (2026-06-20) ### Miscellaneous * **tokf-filter:** Synchronize workspace versions ### Dependencies * The following workspace dependencies were updated * dependencies * tokf-common bumped from 0.2.48 to 0.2.49 </details> <details><summary>tokf-hook-types: 0.2.49</summary> ## [0.2.49](tokf-hook-types-v0.2.48...tokf-hook-types-v0.2.49) (2026-06-20) ### Miscellaneous * **tokf-hook-types:** Synchronize workspace versions </details> <details><summary>tokf-server: 0.2.49</summary> ## [0.2.49](tokf-server-v0.2.48...tokf-server-v0.2.49) (2026-06-20) ### Miscellaneous * **tokf-server:** Synchronize workspace versions ### Dependencies * The following workspace dependencies were updated * dependencies * tokf-common bumped from 0.2.48 to 0.2.49 * tokf-filter bumped from 0.2.48 to 0.2.49 </details> <details><summary>catalog-types: 0.2.49</summary> ## [0.2.49](catalog-types-v0.2.48...catalog-types-v0.2.49) (2026-06-20) ### Miscellaneous * **catalog-types:** Synchronize workspace versions </details> <details><summary>tokf: 0.2.49</summary> ## [0.2.49](tokf-v0.2.48...tokf-v0.2.49) (2026-06-20) ### Bug Fixes * **rewrite:** correct multibyte UTF-8 compound rewrites ([#383](#383)) ([#395](#395)) ([3a822d0](3a822d0)) ### Dependencies * The following workspace dependencies were updated * dependencies * tokf-common bumped from 0.2.48 to 0.2.49 * tokf-filter bumped from 0.2.48 to 0.2.49 * tokf-hook-types bumped from 0.2.48 to 0.2.49 </details> --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please). Co-authored-by: repository-butler[bot] <166800726+repository-butler[bot]@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #383
Summary
Compound shell commands that mixed a multibyte UTF-8 character (
✓,✗, emoji, CJK) in one segment with a chain operator (&&/||/;/newline) were rewritten into invalid shell — spurious|/;tokens were spliced at the wrong byte offsets, so the command failed to parse. This broke everytokf hook-driven rewrite that mixed an emoji/checkmarkechowith a filterable pipe.Before / after
Root cause
Two layers:
Vec<char>), but span ends were computed aspos + value.len()— a char position plus a byte length. For multibyte input the end overshot, so evenNode::source_textreturned the wrong slice for non-final tokens. Fixed in fix(lexer): compute span ends by character count, not byte length rable#72, released as rable 0.2.1.compound_segments/flatten_segmentssliced the byte-indexed source string by raw (character) span values. Now routed through achar_to_bytehelper that mirrors rable's own conversion.The issue's suspected culprit (
rules.rs:75, the{rest}slice) was a red herring — regex offsets are byte offsets by design.Changes
char_to_bytehelper + the two separator-slice sites inrewrite/bash_ast.rs.rable0.1.15 → 0.2.1 (absorbs the 0.2.0 API surface changes; tokf only usesrable::parse+rable::ast).Tests
Three layers of regression coverage — all previously failing, now green:
bash_ast_multibyte_tests.rs(split out to stay under the 700-line file limit).tests_compound.rs.cli_rewrite.rs.cargo test -p tokf(989 lib + 59 CLI), clippy, and fmt all clean.🤖 Generated with Claude Code