Unicode safety: cursor, backspace, and text_diff are UTF-16 code-unit based

## TL;DR

Canopy's text editing layer implicitly assumes that "one character = one UTF-16 code unit." This holds for ASCII-only input, but breaks down for non-BMP characters (emoji, some CJK extension blocks), combining marks, and emoji ZWJ sequences. To make the editor usable for real-world text, we need grapheme boundary computation based on UAX #29.

This issue tracks the editor-layer fix. A related but independent issue exists at the CRDT layer — see [Related Issues](#related-issues).

## Status (2026-05-09)

**PR #239 (Step 1) is open.** The empirical probe added by that PR found the failure modes are bimodal, not uniform — refining the original framing in this issue:

| Category | Inputs | Failure mode | Test style |
|---|---|---|---|
| BMP combining marks | NFD `"e\u{0301}"` (decomposed `é`) | **Wrong output, no abort** — backspace deletes the combining mark only; `compute_edit` reports a 1-code-unit delete | `inspect` pinned to current value |
| Surrogate-pair inputs | Emoji, ZWJ family, regional indicator (flag) | **Hard abort** in eg-walker via `String::sub` mid-surrogate-pair | `test "panic ..."` prefix |

Implications for the rest of this issue:

1. The "lone surrogate **in the resulting document**" scenario described in problem sites 1 and 2 below is **not currently reachable** from `insert` / `backspace` alone — the abort path short-circuits before the editor produces malformed state. The lone-surrogate-via-concurrent-merge scenario is a CRDT-layer problem tracked at [eg-walker#31][egw31] (fix landing in [eg-walker PR #35][egw35]).
2. The editor layer still needs grapheme-aware logic, because once the eg-walker abort is replaced with a typed rejection, malformed editor inputs will surface as `TextError` instead of crashes. Step 2 of this issue (moji-based fixes) prevents the editor from sending malformed content to the CRDT in the first place.
3. `text_diff::find_common_prefix` (problem site 3) is a different surface: it operates on `String` arguments without going through eg-walker, so it does silently produce malformed `Edit` payloads today. It does not benefit from the eg-walker fix.

[egw31]: https://github.com/dowdiness/event-graph-walker/issues/31
[egw35]: https://github.com/dowdiness/event-graph-walker/pull/35

## Problem Sites

Each of the following is a concrete code citation, not speculation. All of them exist on current `main`.

### 1. `SyncEditor::insert` can land the cursor in the middle of a surrogate pair

`editor/sync_editor_text.mbt:11`:

```moonbit
pub fn[T : Eq] SyncEditor::insert(self, text : String) -> Unit raise {
  ...
  self.cursor = self.cursor + text.length()
}
```

`text.length()` counts UTF-16 code units. Inserting `"😀"` (U+1F600) advances the cursor by 2, which looks right on the surface, but the following scenarios break:

- If CRDT merge or undo delivers partial state where the cursor ends up at offset 1 of a `"😀"`-containing document, the cursor sits **between** the high and low surrogate.
- Any subsequent backspace (see problem 2) will then delete exactly one surrogate, leaving a lone surrogate in the document.

### 2. `SyncEditor::backspace` always deletes exactly one code unit

`editor/sync_editor_text.mbt:37-59`:

```moonbit
pub fn[T : Eq] SyncEditor::backspace(self) -> Bool {
  if self.cursor > 0 {
    self.cursor = self.cursor - 1
    try self.doc.delete(@text.Pos::at(self.cursor)) catch { ... }
    ...
  }
}
```

Pressing backspace after `"😀"` (two code units) removes only the low surrogate (U+DE00), leaving the high surrogate (U+D83D) as a lone code unit. The resulting `String` contains ill-formed UTF-16, which will misbehave in any downstream operation (serialization, display, diff).

Additional cases that break the same way:

- `"café"` in NFD form (`c`, `a`, `f`, `e`, U+0301): one backspace removes only the combining acute U+0301. The visible output `cafe` looks like the previous character was deleted, but the user expected the entire `é` grapheme to go.
- `"👨‍👩‍👦"` (eight code units, a ZWJ sequence): one backspace removes only the tail of the last surrogate pair, leaving a malformed ZWJ sequence that may render as `"👨‍👩‍"` with a dangling ZWJ or as garbage.

### 3. `text_diff.find_common_prefix` can split a surrogate pair

`editor/text_diff.mbt:166`:

```moonbit
fn find_common_prefix(s1 : String, s2 : String) -> Int {
  let mut i = 0
  ...
  while i < min_len && s1[i] == s2[i] {
    i = i + 1
  }
  i
}
```

`s1[i] == s2[i]` compares UTF-16 code units. Consider `s1 = "😀A"` and `s2 = "😁B"`: they share the same high surrogate U+D83D, so this function returns `i == 1`. Combined with `find_common_suffix_after_prefix`, the generated `Edit` cuts the diff **inside a surrogate pair**, producing `Edit` payloads that contain lone surrogates.

This affects `SyncEditor::set_text` → `compute_text_change`, which is the path used by external bridges (e.g., ProseMirror), so external text coming in via the bridge can produce malformed edits.

### 4. Test suite has no non-ASCII input

None of `editor/sync_editor_*_test.mbt`, `editor/sync_editor_text_wbtest.mbt`, or `editor/text_diff_test.mbt` contains emoji, CJK, combining marks, or ZWJ sequences. (`ephemeral_wbtest.mbt` has a single Japanese string for serialization testing only.) Consequently, all of the bugs above are **not caught by the existing tests** — they are latent bugs that will surface the first time a user types non-ASCII.

### 5. API documentation does not specify the unit of `position`

`docs/development/API_REFERENCE.md` documents `move_cursor(position : Int)` and `get_cursor() -> Int` without stating what `Int` counts (code units, code points, or graphemes). This is an unspecified public API contract.

## Why fix this now

1. **Unavoidable for real-world editor use.** Japanese input, emoji, and combining marks are table stakes for a modern editor.
2. **The bugs exist today, just hidden.** The lambda calculus demo is ASCII-only, so nothing exposes them. The moment a user types `"こんにちは"` or `"🎉"`, the editor starts producing malformed state.
3. **CRDT merge and undo testing will multiply the surface area.** Nailing the boundary semantics now prevents compound bugs from emerging when we add merge-heavy test cases.
4. **Public API stability.** Bridges (ProseMirror, LSP-like adapters) need to know what unit `position` is. Fixing this early avoids breaking external consumers later.

## Direction of the fix

### Unit policy

Distinguish three units explicitly in documentation and type design:

| Unit | Used by | Type suggestion |
|---|---|---|
| **Code unit offset** | CRDT internals, low-level edits | `Int` (unchanged) |
| **Code point offset** | Transitional / bridge layer | Not needed right now |
| **Grapheme offset** | User-facing ops (cursor, selection, backspace) | Same `Int`, semantically tagged; or a new `GraphemeOffset` opaque type |

### API sketch

Pull UAX #29 grapheme segmentation from an external library (tentatively **moji**, see [Related Issues](#related-issues)) and apply the following edits:

```moonbit
// editor/sync_editor_text.mbt
pub fn[T : Eq] SyncEditor::backspace(self) -> Bool {
  if self.cursor > 0 {
    let text = self.doc.text()
    let new_cursor = @moji_segment.prev_grapheme_boundary(text[:], self.cursor)
    let delete_count = self.cursor - new_cursor
    self.cursor = new_cursor
    for _ in 0..<delete_count {
      self.doc.delete(@text.Pos::at(self.cursor))
    }
    ...
  }
}
```

Cursor clamping after insert:

```moonbit
pub fn[T : Eq] SyncEditor::insert(self, text : String) -> Unit raise {
  ...
  let raw_cursor = self.cursor + text.length()
  let doc_text = self.doc.text()
  self.cursor = @moji_segment.clamp_to_grapheme_boundary(doc_text[:], raw_cursor)
}
```

New public methods for arrow-key navigation:

```moonbit
pub fn SyncEditor::move_cursor_left_grapheme(self) -> Unit
pub fn SyncEditor::move_cursor_right_grapheme(self) -> Unit
pub fn SyncEditor::move_cursor_left_word(self) -> Unit    // UAX #29 word boundary
pub fn SyncEditor::move_cursor_right_word(self) -> Unit
```

### Staged implementation

**Step 1: Add failing tests that expose the current behavior.**

Independent of the moji work, commit tests that document the broken behavior:

```moonbit
test "backspace after emoji should delete the whole emoji" {
  let editor = SyncEditor::new(...)
  editor.set_text("a😀")
  editor.move_cursor(3)   // after the emoji
  editor.backspace()
  inspect(editor.get_text(), content="a")  // currently fails
  inspect(editor.get_cursor(), content="1")
}

test "cursor stays on a grapheme boundary after insert" {
  let editor = SyncEditor::new(...)
  editor.insert("😀")
  inspect(editor.get_cursor(), content="2")
}

test "text_diff does not split a surrogate pair" {
  let edit = compute_edit("😀A", "😁B")
  // Assert the Edit payload contains no lone surrogate
  ...
}
```

Marking them with an xfail convention makes progress visible as fixes land.

**Step 2: Depend on moji and fix the call sites.**

moji is an in-progress MoonBit UAX #29 implementation. Once its API stabilizes, vendor or depend on it.

Planned moji API surface:

```moonbit
@moji/segment::prev_grapheme_boundary(s : StringView, offset : Int) -> Int
@moji/segment::next_grapheme_boundary(s : StringView, offset : Int) -> Int
@moji/segment::grapheme_offsets(s : StringView) -> Iter[Int]
@moji/segment::graphemes(s : StringView) -> Iter[StringView]
@moji/segment::clamp_to_grapheme_boundary(s : StringView, offset : Int) -> Int
```

**Step 3: Document the unit of `position` in API_REFERENCE.md.**

State explicitly that `SyncEditor::move_cursor` and `get_cursor` use grapheme offsets (externally), even though internally the integer is still a code-unit offset post-clamp. Reserve the option of introducing a `GraphemeOffset` opaque type later.

**Step 4: Audit the ProseMirror bridge.**

ProseMirror uses UTF-16 code unit offsets internally. Confirm where the conversion needs to happen and whether the bridge needs its own adapter.

## Out of scope for this issue

- **CRDT-layer surrogate splitting.** Tracked separately — see [Related Issues](#related-issues).
- **Unicode normalization (NFC/NFD).** Distinct issue: `"café"` can be encoded two different ways and they are distinct CRDT strings today.
- **Bidi (UAX #9).** Not planned.
- **IME composition events.** Handled by the frontend (ProseMirror).
- **Full Unicode case mapping** (`to_lower` / `to_upper` for non-ASCII). Separate issue.

## Related issues

- **[eg-walker]** `Document::insert` aborts uncatchably on non-BMP input via `String::sub` mid-surrogate; concurrent merge can also split surrogate pairs across CRDT operations — [dowdiness/event-graph-walker#31](https://github.com/dowdiness/event-graph-walker/issues/31). Fix in [PR #35](https://github.com/dowdiness/event-graph-walker/pull/35) (per-codepoint insert + Op content validation). This is a CRDT-layer issue that cannot be fixed from Canopy alone; it sits below the editor layer covered by this issue.
- **[moji]** WIP MoonBit UAX #29 library that this issue depends on — `dowdiness/moji#???` (or tag/link once the repository exists).

## References

- [UAX #29: Unicode Text Segmentation](https://www.unicode.org/reports/tr29/)
- [Rust unicode-segmentation](https://github.com/unicode-rs/unicode-segmentation) — reference implementation
- [cometkim/unicode-segmenter](https://github.com/cometkim/unicode-segmenter) — JavaScript reference
- [moonbit-community/unicodewidth.mbt](https://github.com/moonbit-community/unicodewidth.mbt) — precedent for UCD table generation in MoonBit

## Checklist

- [ ] Add xfail non-ASCII boundary tests for `insert`, `backspace`, `set_text`, `text_diff`
- [x] Annotate `position` unit in `docs/development/API_REFERENCE.md` (#241)
- [ ] Switch `SyncEditor::backspace` to grapheme boundary once moji is available
- [ ] Clamp post-insert cursor in `SyncEditor::insert` to grapheme boundary
- [ ] Make `text_diff::find_common_prefix` / `find_common_suffix_after_prefix` grapheme-safe
- [ ] Add `move_cursor_left_grapheme` / `_right_grapheme`
- [x] Audit ProseMirror bridge position conversion (#242)

## For implementers

### Current status

- moji (UAX #29 library) does not yet exist. Step 2 and related Checklist items are **blocked** until moji is released.
- Step 1 (add failing tests) can begin immediately without moji.
- Step 3 (API_REFERENCE.md annotation) can begin immediately.

### Suggested PR breakdown

- **PR 1**: Step 1 — add tests covering non-ASCII inputs in `editor/sync_editor_text_wbtest.mbt` and `editor/text_diff_test.mbt`. Do not modify implementation. The tests demonstrate the broken behavior, split into two styles per the [Status](#status-2026-05-09) section: `inspect` pinning for BMP combining marks; `panic` prefix for surrogate-pair inputs that abort in eg-walker. **→ Open as #239.**
- **PR 2**: Step 3 — update `docs/development/API_REFERENCE.md`.
- **PR 3+**: Step 2 and 4 — blocked on moji.

### Pre-work for PR 1

1. Confirm MoonBit's mechanism for marking tests as expected failures. If none exists, the test file should be commented with `// EXPECTED TO FAIL — see issue #<this>` and use `inspect` with the current (broken) output, with a comment explaining what the *correct* output should be. This makes the test both documenting and not-red.
2. Enumerate all call sites that need coverage. At minimum:
   - `SyncEditor::insert`
   - `SyncEditor::backspace`
   - `SyncEditor::delete`
   - `SyncEditor::move_cursor`
   - `text_diff::compute_edit`
3. Sample inputs to use across tests:
   - Plain BMP single: `"あ"`, `"中"`
   - Non-BMP (surrogate pair): `"😀"`, `"𠮷"`
   - Combining mark: `"e\u{0301}"` vs precomposed `"é"`
   - ZWJ sequence: `"👨‍👩‍👦"`
   - Regional indicator pair (flag): `"🇯🇵"`

### Constraints

- Do not modify CRDT internals (`event-graph-walker/` submodule). That is covered by [dowdiness/event-graph-walker#31](https://github.com/dowdiness/event-graph-walker/issues/31).
- Do not add dependencies on experimental packages without explicit approval.
- Run `moon check && moon test` before submitting; new tests expected to fail should be explicitly marked.




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unicode safety: cursor, backspace, and text_diff are UTF-16 code-unit based #216

TL;DR

Status (2026-05-09)

Problem Sites

1. `SyncEditor::insert` can land the cursor in the middle of a surrogate pair

2. `SyncEditor::backspace` always deletes exactly one code unit

3. `text_diff.find_common_prefix` can split a surrogate pair

4. Test suite has no non-ASCII input

5. API documentation does not specify the unit of `position`

Why fix this now

Direction of the fix

Unit policy

API sketch

Staged implementation

Out of scope for this issue

Related issues

References

Checklist

For implementers

Current status

Suggested PR breakdown

Pre-work for PR 1

Constraints

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Category	Inputs	Failure mode	Test style
BMP combining marks	NFD `"e\u{0301}"` (decomposed `é`)	Wrong output, no abort — backspace deletes the combining mark only; `compute_edit` reports a 1-code-unit delete	`inspect` pinned to current value
Surrogate-pair inputs	Emoji, ZWJ family, regional indicator (flag)	Hard abort in eg-walker via `String::sub` mid-surrogate-pair	`test "panic ..."` prefix

Unit	Used by	Type suggestion
Code unit offset	CRDT internals, low-level edits	`Int` (unchanged)
Code point offset	Transitional / bridge layer	Not needed right now
Grapheme offset	User-facing ops (cursor, selection, backspace)	Same `Int`, semantically tagged; or a new `GraphemeOffset` opaque type

Unicode safety: cursor, backspace, and text_diff are UTF-16 code-unit based #216

Description

TL;DR

Status (2026-05-09)

Problem Sites

1. SyncEditor::insert can land the cursor in the middle of a surrogate pair

2. SyncEditor::backspace always deletes exactly one code unit

3. text_diff.find_common_prefix can split a surrogate pair

4. Test suite has no non-ASCII input

5. API documentation does not specify the unit of position

Why fix this now

Direction of the fix

Unit policy

API sketch

Staged implementation

Out of scope for this issue

Related issues

References

Checklist

For implementers

Current status

Suggested PR breakdown

Pre-work for PR 1

Constraints

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

1. `SyncEditor::insert` can land the cursor in the middle of a surrogate pair

2. `SyncEditor::backspace` always deletes exactly one code unit

3. `text_diff.find_common_prefix` can split a surrogate pair

5. API documentation does not specify the unit of `position`