Skip to content

fix(rewrite): correct multibyte UTF-8 compound rewrites (#383)#395

Merged
mpecan merged 2 commits into
mainfrom
fix/383-multibyte-rewrite-separators
Jun 20, 2026
Merged

fix(rewrite): correct multibyte UTF-8 compound rewrites (#383)#395
mpecan merged 2 commits into
mainfrom
fix/383-multibyte-rewrite-separators

Conversation

@mpecan

@mpecan mpecan commented Jun 20, 2026

Copy link
Copy Markdown
Owner

Closes #383

Summary

Compound shell commands that mixed a multibyte UTF-8 character (, , emoji, CJK) in one segment with a chain operator (&&/||/;/newline) were rewritten into invalid shell — spurious |/; tokens were spliced at the wrong byte offsets, so the command failed to parse. This broke every tokf hook-driven rewrite that mixed an emoji/checkmark echo with a filterable pipe.

Before / after

in:   echo "✓ valid" || echo "✗ bad"; ls ~/.claude/skills/ | grep tokf
was:  echo "✓ valid" | |echo "✗ bad";tokf run --baseline-pipe 'grep tokf' ls ~/.claude/skills/   ← invalid
now:  echo "✓ valid" || echo "✗ bad"; tokf run --baseline-pipe 'grep tokf' ls ~/.claude/skills/   ← valid

Root cause

Two layers:

  1. Upstream (rable): AST spans are character indices (the lexer scans a Vec<char>), but span ends were computed as pos + value.len() — a char position plus a byte length. For multibyte input the end overshot, so even Node::source_text returned the wrong slice for non-final tokens. Fixed in fix(lexer): compute span ends by character count, not byte length rable#72, released as rable 0.2.1.
  2. tokf: compound_segments/flatten_segments sliced the byte-indexed source string by raw (character) span values. Now routed through a char_to_byte helper that mirrors rable's own conversion.

The issue's suspected culprit (rules.rs:75, the {rest} slice) was a red herring — regex offsets are byte offsets by design.

Changes

  • char_to_byte helper + the two separator-slice sites in rewrite/bash_ast.rs.
  • Bump rable 0.1.15 → 0.2.1 (absorbs the 0.2.0 API surface changes; tokf only uses rable::parse + rable::ast).

Tests

Three layers of regression coverage — all previously failing, now green:

  • AST round-trip + exact separators — new bash_ast_multibyte_tests.rs (split out to stay under the 700-line file limit).
  • In-process rewritetests_compound.rs.
  • Real binary an LLM hook invokescli_rewrite.rs.

cargo test -p tokf (989 lib + 59 CLI), clippy, and fmt all clean.

🤖 Generated with Claude Code

mpecan and others added 2 commits June 20, 2026 10:09
…UTF-8 (#383)

rable's AST spans are character indices (the lexer scans a `Vec<char>`),
but `compound_segments`/`flatten_segments` sliced the byte-indexed source
string by raw span values. For compound commands containing multibyte
UTF-8 (e.g. `echo "✓ valid" || …; … | grep x`) the `&&`/`||`/`;`
separators were sliced at the wrong byte offsets, emitting invalid shell
(`| ||`, a dropped `; `) that failed to parse — corrupting every
hook-driven rewrite that mixed an emoji/checkmark echo with a pipe.

Add a `char_to_byte` helper (mirroring rable's own `Node::source_text`
conversion) and route the two separator-slice sites through it.

Regression tests at three layers: AST round-trip + exact separators
(new `bash_ast_multibyte_tests`, split out to stay under the 700-line
file limit), in-process rewrite (`tests_compound`), and the real binary
an LLM hook invokes (`cli_rewrite`).

NOTE: depends on the rable span-end fix (mpecan/rable#72). The byte/char
conversion here is only correct once rable reports consistent character
spans; the `rable` dependency bump to the released, fixed version is a
follow-up commit. Until then the new tests fail against rable 0.1.15.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…383)

rable 0.2.1 reports AST spans as consistent character indices (the
char-pos + byte-len span-end bug was fixed in mpecan/rable#72). This is
what makes the preceding `char_to_byte` separator-slice fix correct, so
the multibyte regression suite now passes.

Also absorbs the rable 0.2.0 API surface changes (WordSpan relocation,
tightened lexer API) — tokf only uses `rable::parse` and `rable::ast`,
which remain compatible.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@repository-butler

Copy link
Copy Markdown
Contributor

Filter Verification Report

Changed Filters

No filter files changed in this PR.

All Filters Summary

✅ 143/143 test cases passed across 51 filters


Generated by tokf verify

@mpecan mpecan merged commit 3a822d0 into main Jun 20, 2026
5 checks passed
@mpecan mpecan deleted the fix/383-multibyte-rewrite-separators branch June 20, 2026 09:20
@repository-butler repository-butler Bot mentioned this pull request Jun 20, 2026
mpecan pushed a commit that referenced this pull request Jun 20, 2026
🤖 I have created a release *beep* *boop*
---


<details><summary>e2e-tests: 0.1.37</summary>

### Dependencies


</details>

<details><summary>tokf-common: 0.2.49</summary>

##
[0.2.49](tokf-common-v0.2.48...tokf-common-v0.2.49)
(2026-06-20)


### Miscellaneous

* **tokf-common:** Synchronize workspace versions
</details>

<details><summary>tokf-filter: 0.2.49</summary>

##
[0.2.49](tokf-filter-v0.2.48...tokf-filter-v0.2.49)
(2026-06-20)


### Miscellaneous

* **tokf-filter:** Synchronize workspace versions


### Dependencies

* The following workspace dependencies were updated
  * dependencies
    * tokf-common bumped from 0.2.48 to 0.2.49
</details>

<details><summary>tokf-hook-types: 0.2.49</summary>

##
[0.2.49](tokf-hook-types-v0.2.48...tokf-hook-types-v0.2.49)
(2026-06-20)


### Miscellaneous

* **tokf-hook-types:** Synchronize workspace versions
</details>

<details><summary>tokf-server: 0.2.49</summary>

##
[0.2.49](tokf-server-v0.2.48...tokf-server-v0.2.49)
(2026-06-20)


### Miscellaneous

* **tokf-server:** Synchronize workspace versions


### Dependencies

* The following workspace dependencies were updated
  * dependencies
    * tokf-common bumped from 0.2.48 to 0.2.49
    * tokf-filter bumped from 0.2.48 to 0.2.49
</details>

<details><summary>catalog-types: 0.2.49</summary>

##
[0.2.49](catalog-types-v0.2.48...catalog-types-v0.2.49)
(2026-06-20)


### Miscellaneous

* **catalog-types:** Synchronize workspace versions
</details>

<details><summary>tokf: 0.2.49</summary>

##
[0.2.49](tokf-v0.2.48...tokf-v0.2.49)
(2026-06-20)


### Bug Fixes

* **rewrite:** correct multibyte UTF-8 compound rewrites
([#383](#383))
([#395](#395))
([3a822d0](3a822d0))


### Dependencies

* The following workspace dependencies were updated
  * dependencies
    * tokf-common bumped from 0.2.48 to 0.2.49
    * tokf-filter bumped from 0.2.48 to 0.2.49
    * tokf-hook-types bumped from 0.2.48 to 0.2.49
</details>

---
This PR was generated with [Release
Please](https://github.com/googleapis/release-please). See
[documentation](https://github.com/googleapis/release-please#release-please).

Co-authored-by: repository-butler[bot] <166800726+repository-butler[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

rewrite: multibyte UTF-8 + piped command produces invalid shell (spurious pipe / semicolon)

1 participant