fix(rewrite): correct multibyte UTF-8 compound rewrites (#383) by mpecan · Pull Request #395 · mpecan/tokf

mpecan · 2026-06-20T08:18:35Z

Closes #383

Summary

Compound shell commands that mixed a multibyte UTF-8 character (✓, ✗, emoji, CJK) in one segment with a chain operator (&&/||/;/newline) were rewritten into invalid shell — spurious |/; tokens were spliced at the wrong byte offsets, so the command failed to parse. This broke every tokf hook-driven rewrite that mixed an emoji/checkmark echo with a filterable pipe.

Before / after

in:   echo "✓ valid" || echo "✗ bad"; ls ~/.claude/skills/ | grep tokf
was:  echo "✓ valid" | |echo "✗ bad";tokf run --baseline-pipe 'grep tokf' ls ~/.claude/skills/   ← invalid
now:  echo "✓ valid" || echo "✗ bad"; tokf run --baseline-pipe 'grep tokf' ls ~/.claude/skills/   ← valid

Root cause

Two layers:

Upstream (rable): AST spans are character indices (the lexer scans a Vec<char>), but span ends were computed as pos + value.len() — a char position plus a byte length. For multibyte input the end overshot, so even Node::source_text returned the wrong slice for non-final tokens. Fixed in fix(lexer): compute span ends by character count, not byte length rable#72, released as rable 0.2.1.
tokf: compound_segments/flatten_segments sliced the byte-indexed source string by raw (character) span values. Now routed through a char_to_byte helper that mirrors rable's own conversion.

The issue's suspected culprit (rules.rs:75, the {rest} slice) was a red herring — regex offsets are byte offsets by design.

Changes

char_to_byte helper + the two separator-slice sites in rewrite/bash_ast.rs.
Bump rable 0.1.15 → 0.2.1 (absorbs the 0.2.0 API surface changes; tokf only uses rable::parse + rable::ast).

Tests

Three layers of regression coverage — all previously failing, now green:

AST round-trip + exact separators — new bash_ast_multibyte_tests.rs (split out to stay under the 700-line file limit).
In-process rewrite — tests_compound.rs.
Real binary an LLM hook invokes — cli_rewrite.rs.

cargo test -p tokf (989 lib + 59 CLI), clippy, and fmt all clean.

🤖 Generated with Claude Code

…UTF-8 (#383) rable's AST spans are character indices (the lexer scans a `Vec<char>`), but `compound_segments`/`flatten_segments` sliced the byte-indexed source string by raw span values. For compound commands containing multibyte UTF-8 (e.g. `echo "✓ valid" || …; … | grep x`) the `&&`/`||`/`;` separators were sliced at the wrong byte offsets, emitting invalid shell (`| ||`, a dropped `; `) that failed to parse — corrupting every hook-driven rewrite that mixed an emoji/checkmark echo with a pipe. Add a `char_to_byte` helper (mirroring rable's own `Node::source_text` conversion) and route the two separator-slice sites through it. Regression tests at three layers: AST round-trip + exact separators (new `bash_ast_multibyte_tests`, split out to stay under the 700-line file limit), in-process rewrite (`tests_compound`), and the real binary an LLM hook invokes (`cli_rewrite`). NOTE: depends on the rable span-end fix (mpecan/rable#72). The byte/char conversion here is only correct once rable reports consistent character spans; the `rable` dependency bump to the released, fixed version is a follow-up commit. Until then the new tests fail against rable 0.1.15. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…383) rable 0.2.1 reports AST spans as consistent character indices (the char-pos + byte-len span-end bug was fixed in mpecan/rable#72). This is what makes the preceding `char_to_byte` separator-slice fix correct, so the multibyte regression suite now passes. Also absorbs the rable 0.2.0 API surface changes (WordSpan relocation, tightened lexer API) — tokf only uses `rable::parse` and `rable::ast`, which remain compatible. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

repository-butler · 2026-06-20T08:25:36Z

Filter Verification Report

Changed Filters

No filter files changed in this PR.

All Filters Summary

✅ 143/143 test cases passed across 51 filters

_{Generated by tokf verify}

🤖 I have created a release *beep* *boop* --- <details><summary>e2e-tests: 0.1.37</summary> ### Dependencies </details> <details><summary>tokf-common: 0.2.49</summary> ## [0.2.49](tokf-common-v0.2.48...tokf-common-v0.2.49) (2026-06-20) ### Miscellaneous * **tokf-common:** Synchronize workspace versions </details> <details><summary>tokf-filter: 0.2.49</summary> ## [0.2.49](tokf-filter-v0.2.48...tokf-filter-v0.2.49) (2026-06-20) ### Miscellaneous * **tokf-filter:** Synchronize workspace versions ### Dependencies * The following workspace dependencies were updated * dependencies * tokf-common bumped from 0.2.48 to 0.2.49 </details> <details><summary>tokf-hook-types: 0.2.49</summary> ## [0.2.49](tokf-hook-types-v0.2.48...tokf-hook-types-v0.2.49) (2026-06-20) ### Miscellaneous * **tokf-hook-types:** Synchronize workspace versions </details> <details><summary>tokf-server: 0.2.49</summary> ## [0.2.49](tokf-server-v0.2.48...tokf-server-v0.2.49) (2026-06-20) ### Miscellaneous * **tokf-server:** Synchronize workspace versions ### Dependencies * The following workspace dependencies were updated * dependencies * tokf-common bumped from 0.2.48 to 0.2.49 * tokf-filter bumped from 0.2.48 to 0.2.49 </details> <details><summary>catalog-types: 0.2.49</summary> ## [0.2.49](catalog-types-v0.2.48...catalog-types-v0.2.49) (2026-06-20) ### Miscellaneous * **catalog-types:** Synchronize workspace versions </details> <details><summary>tokf: 0.2.49</summary> ## [0.2.49](tokf-v0.2.48...tokf-v0.2.49) (2026-06-20) ### Bug Fixes * **rewrite:** correct multibyte UTF-8 compound rewrites ([#383](#383)) ([#395](#395)) ([3a822d0](3a822d0)) ### Dependencies * The following workspace dependencies were updated * dependencies * tokf-common bumped from 0.2.48 to 0.2.49 * tokf-filter bumped from 0.2.48 to 0.2.49 * tokf-hook-types bumped from 0.2.48 to 0.2.49 </details> --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please). Co-authored-by: repository-butler[bot] <166800726+repository-butler[bot]@users.noreply.github.com>

mpecan and others added 2 commits June 20, 2026 10:09

mpecan merged commit 3a822d0 into main Jun 20, 2026
5 checks passed

mpecan deleted the fix/383-multibyte-rewrite-separators branch June 20, 2026 09:20

repository-butler Bot mentioned this pull request Jun 20, 2026

chore: release main #396

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(rewrite): correct multibyte UTF-8 compound rewrites (#383)#395

fix(rewrite): correct multibyte UTF-8 compound rewrites (#383)#395
mpecan merged 2 commits into
mainfrom
fix/383-multibyte-rewrite-separators

mpecan commented Jun 20, 2026

Uh oh!

repository-butler Bot commented Jun 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mpecan commented Jun 20, 2026

Summary

Before / after

Root cause

Changes

Tests

Uh oh!

repository-butler Bot commented Jun 20, 2026

Filter Verification Report

Changed Filters

All Filters Summary

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant