Skip to content

Unicode safety: cursor, backspace, and text_diff are UTF-16 code-unit based #216

@dowdiness

Description

@dowdiness

TL;DR

Canopy's text editing layer implicitly assumes that "one character = one UTF-16 code unit." This holds for ASCII-only input, but breaks down for non-BMP characters (emoji, some CJK extension blocks), combining marks, and emoji ZWJ sequences. To make the editor usable for real-world text, we need grapheme boundary computation based on UAX #29.

This issue tracks the editor-layer fix. A related but independent issue exists at the CRDT layer — see Related Issues.

Status (2026-05-09)

PR #239 (Step 1) is open. The empirical probe added by that PR found the failure modes are bimodal, not uniform — refining the original framing in this issue:

Category Inputs Failure mode Test style
BMP combining marks NFD "e\u{0301}" (decomposed é) Wrong output, no abort — backspace deletes the combining mark only; compute_edit reports a 1-code-unit delete inspect pinned to current value
Surrogate-pair inputs Emoji, ZWJ family, regional indicator (flag) Hard abort in eg-walker via String::sub mid-surrogate-pair test "panic ..." prefix

Implications for the rest of this issue:

  1. The "lone surrogate in the resulting document" scenario described in problem sites 1 and 2 below is not currently reachable from insert / backspace alone — the abort path short-circuits before the editor produces malformed state. The lone-surrogate-via-concurrent-merge scenario is a CRDT-layer problem tracked at eg-walker#31 (fix landing in eg-walker PR #35).
  2. The editor layer still needs grapheme-aware logic, because once the eg-walker abort is replaced with a typed rejection, malformed editor inputs will surface as TextError instead of crashes. Step 2 of this issue (moji-based fixes) prevents the editor from sending malformed content to the CRDT in the first place.
  3. text_diff::find_common_prefix (problem site 3) is a different surface: it operates on String arguments without going through eg-walker, so it does silently produce malformed Edit payloads today. It does not benefit from the eg-walker fix.

Problem Sites

Each of the following is a concrete code citation, not speculation. All of them exist on current main.

1. SyncEditor::insert can land the cursor in the middle of a surrogate pair

editor/sync_editor_text.mbt:11:

pub fn[T : Eq] SyncEditor::insert(self, text : String) -> Unit raise {
  ...
  self.cursor = self.cursor + text.length()
}

text.length() counts UTF-16 code units. Inserting "😀" (U+1F600) advances the cursor by 2, which looks right on the surface, but the following scenarios break:

  • If CRDT merge or undo delivers partial state where the cursor ends up at offset 1 of a "😀"-containing document, the cursor sits between the high and low surrogate.
  • Any subsequent backspace (see problem 2) will then delete exactly one surrogate, leaving a lone surrogate in the document.

2. SyncEditor::backspace always deletes exactly one code unit

editor/sync_editor_text.mbt:37-59:

pub fn[T : Eq] SyncEditor::backspace(self) -> Bool {
  if self.cursor > 0 {
    self.cursor = self.cursor - 1
    try self.doc.delete(@text.Pos::at(self.cursor)) catch { ... }
    ...
  }
}

Pressing backspace after "😀" (two code units) removes only the low surrogate (U+DE00), leaving the high surrogate (U+D83D) as a lone code unit. The resulting String contains ill-formed UTF-16, which will misbehave in any downstream operation (serialization, display, diff).

Additional cases that break the same way:

  • "café" in NFD form (c, a, f, e, U+0301): one backspace removes only the combining acute U+0301. The visible output cafe looks like the previous character was deleted, but the user expected the entire é grapheme to go.
  • "👨‍👩‍👦" (eight code units, a ZWJ sequence): one backspace removes only the tail of the last surrogate pair, leaving a malformed ZWJ sequence that may render as "👨‍👩‍" with a dangling ZWJ or as garbage.

3. text_diff.find_common_prefix can split a surrogate pair

editor/text_diff.mbt:166:

fn find_common_prefix(s1 : String, s2 : String) -> Int {
  let mut i = 0
  ...
  while i < min_len && s1[i] == s2[i] {
    i = i + 1
  }
  i
}

s1[i] == s2[i] compares UTF-16 code units. Consider s1 = "😀A" and s2 = "😁B": they share the same high surrogate U+D83D, so this function returns i == 1. Combined with find_common_suffix_after_prefix, the generated Edit cuts the diff inside a surrogate pair, producing Edit payloads that contain lone surrogates.

This affects SyncEditor::set_textcompute_text_change, which is the path used by external bridges (e.g., ProseMirror), so external text coming in via the bridge can produce malformed edits.

4. Test suite has no non-ASCII input

None of editor/sync_editor_*_test.mbt, editor/sync_editor_text_wbtest.mbt, or editor/text_diff_test.mbt contains emoji, CJK, combining marks, or ZWJ sequences. (ephemeral_wbtest.mbt has a single Japanese string for serialization testing only.) Consequently, all of the bugs above are not caught by the existing tests — they are latent bugs that will surface the first time a user types non-ASCII.

5. API documentation does not specify the unit of position

docs/development/API_REFERENCE.md documents move_cursor(position : Int) and get_cursor() -> Int without stating what Int counts (code units, code points, or graphemes). This is an unspecified public API contract.

Why fix this now

  1. Unavoidable for real-world editor use. Japanese input, emoji, and combining marks are table stakes for a modern editor.
  2. The bugs exist today, just hidden. The lambda calculus demo is ASCII-only, so nothing exposes them. The moment a user types "こんにちは" or "🎉", the editor starts producing malformed state.
  3. CRDT merge and undo testing will multiply the surface area. Nailing the boundary semantics now prevents compound bugs from emerging when we add merge-heavy test cases.
  4. Public API stability. Bridges (ProseMirror, LSP-like adapters) need to know what unit position is. Fixing this early avoids breaking external consumers later.

Direction of the fix

Unit policy

Distinguish three units explicitly in documentation and type design:

Unit Used by Type suggestion
Code unit offset CRDT internals, low-level edits Int (unchanged)
Code point offset Transitional / bridge layer Not needed right now
Grapheme offset User-facing ops (cursor, selection, backspace) Same Int, semantically tagged; or a new GraphemeOffset opaque type

API sketch

Pull UAX #29 grapheme segmentation from an external library (tentatively moji, see Related Issues) and apply the following edits:

// editor/sync_editor_text.mbt
pub fn[T : Eq] SyncEditor::backspace(self) -> Bool {
  if self.cursor > 0 {
    let text = self.doc.text()
    let new_cursor = @moji_segment.prev_grapheme_boundary(text[:], self.cursor)
    let delete_count = self.cursor - new_cursor
    self.cursor = new_cursor
    for _ in 0..<delete_count {
      self.doc.delete(@text.Pos::at(self.cursor))
    }
    ...
  }
}

Cursor clamping after insert:

pub fn[T : Eq] SyncEditor::insert(self, text : String) -> Unit raise {
  ...
  let raw_cursor = self.cursor + text.length()
  let doc_text = self.doc.text()
  self.cursor = @moji_segment.clamp_to_grapheme_boundary(doc_text[:], raw_cursor)
}

New public methods for arrow-key navigation:

pub fn SyncEditor::move_cursor_left_grapheme(self) -> Unit
pub fn SyncEditor::move_cursor_right_grapheme(self) -> Unit
pub fn SyncEditor::move_cursor_left_word(self) -> Unit    // UAX #29 word boundary
pub fn SyncEditor::move_cursor_right_word(self) -> Unit

Staged implementation

Step 1: Add failing tests that expose the current behavior.

Independent of the moji work, commit tests that document the broken behavior:

test "backspace after emoji should delete the whole emoji" {
  let editor = SyncEditor::new(...)
  editor.set_text("a😀")
  editor.move_cursor(3)   // after the emoji
  editor.backspace()
  inspect(editor.get_text(), content="a")  // currently fails
  inspect(editor.get_cursor(), content="1")
}

test "cursor stays on a grapheme boundary after insert" {
  let editor = SyncEditor::new(...)
  editor.insert("😀")
  inspect(editor.get_cursor(), content="2")
}

test "text_diff does not split a surrogate pair" {
  let edit = compute_edit("😀A", "😁B")
  // Assert the Edit payload contains no lone surrogate
  ...
}

Marking them with an xfail convention makes progress visible as fixes land.

Step 2: Depend on moji and fix the call sites.

moji is an in-progress MoonBit UAX #29 implementation. Once its API stabilizes, vendor or depend on it.

Planned moji API surface:

@moji/segment::prev_grapheme_boundary(s : StringView, offset : Int) -> Int
@moji/segment::next_grapheme_boundary(s : StringView, offset : Int) -> Int
@moji/segment::grapheme_offsets(s : StringView) -> Iter[Int]
@moji/segment::graphemes(s : StringView) -> Iter[StringView]
@moji/segment::clamp_to_grapheme_boundary(s : StringView, offset : Int) -> Int

Step 3: Document the unit of position in API_REFERENCE.md.

State explicitly that SyncEditor::move_cursor and get_cursor use grapheme offsets (externally), even though internally the integer is still a code-unit offset post-clamp. Reserve the option of introducing a GraphemeOffset opaque type later.

Step 4: Audit the ProseMirror bridge.

ProseMirror uses UTF-16 code unit offsets internally. Confirm where the conversion needs to happen and whether the bridge needs its own adapter.

Out of scope for this issue

  • CRDT-layer surrogate splitting. Tracked separately — see Related Issues.
  • Unicode normalization (NFC/NFD). Distinct issue: "café" can be encoded two different ways and they are distinct CRDT strings today.
  • Bidi (UAX chore(deps): bump react and @types/react in /demo-react #9). Not planned.
  • IME composition events. Handled by the frontend (ProseMirror).
  • Full Unicode case mapping (to_lower / to_upper for non-ASCII). Separate issue.

Related issues

  • [eg-walker] Document::insert aborts uncatchably on non-BMP input via String::sub mid-surrogate; concurrent merge can also split surrogate pairs across CRDT operations — dowdiness/event-graph-walker#31. Fix in PR #35 (per-codepoint insert + Op content validation). This is a CRDT-layer issue that cannot be fixed from Canopy alone; it sits below the editor layer covered by this issue.
  • [moji] WIP MoonBit UAX chore(deps): bump react and @types/react in /examples/demo-react #29 library that this issue depends on — dowdiness/moji#??? (or tag/link once the repository exists).

References

Checklist

  • Add xfail non-ASCII boundary tests for insert, backspace, set_text, text_diff
  • Annotate position unit in docs/development/API_REFERENCE.md (docs(api): annotate position units (#216 step 3) #241)
  • Switch SyncEditor::backspace to grapheme boundary once moji is available
  • Clamp post-insert cursor in SyncEditor::insert to grapheme boundary
  • Make text_diff::find_common_prefix / find_common_suffix_after_prefix grapheme-safe
  • Add move_cursor_left_grapheme / _right_grapheme
  • Audit ProseMirror bridge position conversion (docs(plans): #216 step 4 — bridge position-unit audit #242)

For implementers

Current status

Suggested PR breakdown

  • PR 1: Step 1 — add tests covering non-ASCII inputs in editor/sync_editor_text_wbtest.mbt and editor/text_diff_test.mbt. Do not modify implementation. The tests demonstrate the broken behavior, split into two styles per the Status section: inspect pinning for BMP combining marks; panic prefix for surrogate-pair inputs that abort in eg-walker. → Open as test(editor): pin current non-ASCII broken behavior (#216 step 1) #239.
  • PR 2: Step 3 — update docs/development/API_REFERENCE.md.
  • PR 3+: Step 2 and 4 — blocked on moji.

Pre-work for PR 1

  1. Confirm MoonBit's mechanism for marking tests as expected failures. If none exists, the test file should be commented with // EXPECTED TO FAIL — see issue #<this> and use inspect with the current (broken) output, with a comment explaining what the correct output should be. This makes the test both documenting and not-red.
  2. Enumerate all call sites that need coverage. At minimum:
    • SyncEditor::insert
    • SyncEditor::backspace
    • SyncEditor::delete
    • SyncEditor::move_cursor
    • text_diff::compute_edit
  3. Sample inputs to use across tests:
    • Plain BMP single: "あ", "中"
    • Non-BMP (surrogate pair): "😀", "𠮷"
    • Combining mark: "e\u{0301}" vs precomposed "é"
    • ZWJ sequence: "👨‍👩‍👦"
    • Regional indicator pair (flag): "🇯🇵"

Constraints

  • Do not modify CRDT internals (event-graph-walker/ submodule). That is covered by dowdiness/event-graph-walker#31.
  • Do not add dependencies on experimental packages without explicit approval.
  • Run moon check && moon test before submitting; new tests expected to fail should be explicitly marked.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions