You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Canopy's text editing layer implicitly assumes that "one character = one UTF-16 code unit." This holds for ASCII-only input, but breaks down for non-BMP characters (emoji, some CJK extension blocks), combining marks, and emoji ZWJ sequences. To make the editor usable for real-world text, we need grapheme boundary computation based on UAX #29.
This issue tracks the editor-layer fix. A related but independent issue exists at the CRDT layer — see Related Issues.
Status (2026-05-09)
PR #239 (Step 1) is open. The empirical probe added by that PR found the failure modes are bimodal, not uniform — refining the original framing in this issue:
Category
Inputs
Failure mode
Test style
BMP combining marks
NFD "e\u{0301}" (decomposed é)
Wrong output, no abort — backspace deletes the combining mark only; compute_edit reports a 1-code-unit delete
inspect pinned to current value
Surrogate-pair inputs
Emoji, ZWJ family, regional indicator (flag)
Hard abort in eg-walker via String::sub mid-surrogate-pair
test "panic ..." prefix
Implications for the rest of this issue:
The "lone surrogate in the resulting document" scenario described in problem sites 1 and 2 below is not currently reachable from insert / backspace alone — the abort path short-circuits before the editor produces malformed state. The lone-surrogate-via-concurrent-merge scenario is a CRDT-layer problem tracked at eg-walker#31 (fix landing in eg-walker PR #35).
The editor layer still needs grapheme-aware logic, because once the eg-walker abort is replaced with a typed rejection, malformed editor inputs will surface as TextError instead of crashes. Step 2 of this issue (moji-based fixes) prevents the editor from sending malformed content to the CRDT in the first place.
text_diff::find_common_prefix (problem site 3) is a different surface: it operates on String arguments without going through eg-walker, so it does silently produce malformed Edit payloads today. It does not benefit from the eg-walker fix.
Problem Sites
Each of the following is a concrete code citation, not speculation. All of them exist on current main.
1. SyncEditor::insert can land the cursor in the middle of a surrogate pair
text.length() counts UTF-16 code units. Inserting "😀" (U+1F600) advances the cursor by 2, which looks right on the surface, but the following scenarios break:
If CRDT merge or undo delivers partial state where the cursor ends up at offset 1 of a "😀"-containing document, the cursor sits between the high and low surrogate.
Any subsequent backspace (see problem 2) will then delete exactly one surrogate, leaving a lone surrogate in the document.
2. SyncEditor::backspace always deletes exactly one code unit
Pressing backspace after "😀" (two code units) removes only the low surrogate (U+DE00), leaving the high surrogate (U+D83D) as a lone code unit. The resulting String contains ill-formed UTF-16, which will misbehave in any downstream operation (serialization, display, diff).
Additional cases that break the same way:
"café" in NFD form (c, a, f, e, U+0301): one backspace removes only the combining acute U+0301. The visible output cafe looks like the previous character was deleted, but the user expected the entire é grapheme to go.
"👨👩👦" (eight code units, a ZWJ sequence): one backspace removes only the tail of the last surrogate pair, leaving a malformed ZWJ sequence that may render as "👨👩" with a dangling ZWJ or as garbage.
3. text_diff.find_common_prefix can split a surrogate pair
s1[i] == s2[i] compares UTF-16 code units. Consider s1 = "😀A" and s2 = "😁B": they share the same high surrogate U+D83D, so this function returns i == 1. Combined with find_common_suffix_after_prefix, the generated Edit cuts the diff inside a surrogate pair, producing Edit payloads that contain lone surrogates.
This affects SyncEditor::set_text → compute_text_change, which is the path used by external bridges (e.g., ProseMirror), so external text coming in via the bridge can produce malformed edits.
4. Test suite has no non-ASCII input
None of editor/sync_editor_*_test.mbt, editor/sync_editor_text_wbtest.mbt, or editor/text_diff_test.mbt contains emoji, CJK, combining marks, or ZWJ sequences. (ephemeral_wbtest.mbt has a single Japanese string for serialization testing only.) Consequently, all of the bugs above are not caught by the existing tests — they are latent bugs that will surface the first time a user types non-ASCII.
5. API documentation does not specify the unit of position
docs/development/API_REFERENCE.md documents move_cursor(position : Int) and get_cursor() -> Int without stating what Int counts (code units, code points, or graphemes). This is an unspecified public API contract.
Why fix this now
Unavoidable for real-world editor use. Japanese input, emoji, and combining marks are table stakes for a modern editor.
The bugs exist today, just hidden. The lambda calculus demo is ASCII-only, so nothing exposes them. The moment a user types "こんにちは" or "🎉", the editor starts producing malformed state.
CRDT merge and undo testing will multiply the surface area. Nailing the boundary semantics now prevents compound bugs from emerging when we add merge-heavy test cases.
Public API stability. Bridges (ProseMirror, LSP-like adapters) need to know what unit position is. Fixing this early avoids breaking external consumers later.
Direction of the fix
Unit policy
Distinguish three units explicitly in documentation and type design:
Unit
Used by
Type suggestion
Code unit offset
CRDT internals, low-level edits
Int (unchanged)
Code point offset
Transitional / bridge layer
Not needed right now
Grapheme offset
User-facing ops (cursor, selection, backspace)
Same Int, semantically tagged; or a new GraphemeOffset opaque type
API sketch
Pull UAX #29 grapheme segmentation from an external library (tentatively moji, see Related Issues) and apply the following edits:
pubfnSyncEditor::move_cursor_left_grapheme(self) ->UnitpubfnSyncEditor::move_cursor_right_grapheme(self) ->UnitpubfnSyncEditor::move_cursor_left_word(self) ->Unit// UAX #29 word boundarypubfnSyncEditor::move_cursor_right_word(self) ->Unit
Staged implementation
Step 1: Add failing tests that expose the current behavior.
Independent of the moji work, commit tests that document the broken behavior:
test"backspace after emoji should delete the whole emoji" {
leteditor=SyncEditor::new(...)
editor.set_text("a😀")
editor.move_cursor(3) // after the emojieditor.backspace()
inspect(editor.get_text(), content="a") // currently failsinspect(editor.get_cursor(), content="1")
}
test"cursor stays on a grapheme boundary after insert" {
leteditor=SyncEditor::new(...)
editor.insert("😀")
inspect(editor.get_cursor(), content="2")
}
test"text_diff does not split a surrogate pair" {
letedit=compute_edit("😀A", "😁B")
// Assert the Edit payload contains no lone surrogate
...
}
Marking them with an xfail convention makes progress visible as fixes land.
Step 2: Depend on moji and fix the call sites.
moji is an in-progress MoonBit UAX #29 implementation. Once its API stabilizes, vendor or depend on it.
Step 3: Document the unit of position in API_REFERENCE.md.
State explicitly that SyncEditor::move_cursor and get_cursor use grapheme offsets (externally), even though internally the integer is still a code-unit offset post-clamp. Reserve the option of introducing a GraphemeOffset opaque type later.
Step 4: Audit the ProseMirror bridge.
ProseMirror uses UTF-16 code unit offsets internally. Confirm where the conversion needs to happen and whether the bridge needs its own adapter.
Out of scope for this issue
CRDT-layer surrogate splitting. Tracked separately — see Related Issues.
Unicode normalization (NFC/NFD). Distinct issue: "café" can be encoded two different ways and they are distinct CRDT strings today.
IME composition events. Handled by the frontend (ProseMirror).
Full Unicode case mapping (to_lower / to_upper for non-ASCII). Separate issue.
Related issues
[eg-walker]Document::insert aborts uncatchably on non-BMP input via String::sub mid-surrogate; concurrent merge can also split surrogate pairs across CRDT operations — dowdiness/event-graph-walker#31. Fix in PR #35 (per-codepoint insert + Op content validation). This is a CRDT-layer issue that cannot be fixed from Canopy alone; it sits below the editor layer covered by this issue.
Step 1 (add failing tests) can begin immediately without moji.
Step 3 (API_REFERENCE.md annotation) can begin immediately.
Suggested PR breakdown
PR 1: Step 1 — add tests covering non-ASCII inputs in editor/sync_editor_text_wbtest.mbt and editor/text_diff_test.mbt. Do not modify implementation. The tests demonstrate the broken behavior, split into two styles per the Status section: inspect pinning for BMP combining marks; panic prefix for surrogate-pair inputs that abort in eg-walker. → Open as test(editor): pin current non-ASCII broken behavior (#216 step 1) #239.
Confirm MoonBit's mechanism for marking tests as expected failures. If none exists, the test file should be commented with // EXPECTED TO FAIL — see issue #<this> and use inspect with the current (broken) output, with a comment explaining what the correct output should be. This makes the test both documenting and not-red.
Enumerate all call sites that need coverage. At minimum:
TL;DR
Canopy's text editing layer implicitly assumes that "one character = one UTF-16 code unit." This holds for ASCII-only input, but breaks down for non-BMP characters (emoji, some CJK extension blocks), combining marks, and emoji ZWJ sequences. To make the editor usable for real-world text, we need grapheme boundary computation based on UAX #29.
This issue tracks the editor-layer fix. A related but independent issue exists at the CRDT layer — see Related Issues.
Status (2026-05-09)
PR #239 (Step 1) is open. The empirical probe added by that PR found the failure modes are bimodal, not uniform — refining the original framing in this issue:
"e\u{0301}"(decomposedé)compute_editreports a 1-code-unit deleteinspectpinned to current valueString::submid-surrogate-pairtest "panic ..."prefixImplications for the rest of this issue:
insert/backspacealone — the abort path short-circuits before the editor produces malformed state. The lone-surrogate-via-concurrent-merge scenario is a CRDT-layer problem tracked at eg-walker#31 (fix landing in eg-walker PR #35).TextErrorinstead of crashes. Step 2 of this issue (moji-based fixes) prevents the editor from sending malformed content to the CRDT in the first place.text_diff::find_common_prefix(problem site 3) is a different surface: it operates onStringarguments without going through eg-walker, so it does silently produce malformedEditpayloads today. It does not benefit from the eg-walker fix.Problem Sites
Each of the following is a concrete code citation, not speculation. All of them exist on current
main.1.
SyncEditor::insertcan land the cursor in the middle of a surrogate paireditor/sync_editor_text.mbt:11:text.length()counts UTF-16 code units. Inserting"😀"(U+1F600) advances the cursor by 2, which looks right on the surface, but the following scenarios break:"😀"-containing document, the cursor sits between the high and low surrogate.2.
SyncEditor::backspacealways deletes exactly one code uniteditor/sync_editor_text.mbt:37-59:Pressing backspace after
"😀"(two code units) removes only the low surrogate (U+DE00), leaving the high surrogate (U+D83D) as a lone code unit. The resultingStringcontains ill-formed UTF-16, which will misbehave in any downstream operation (serialization, display, diff).Additional cases that break the same way:
"café"in NFD form (c,a,f,e, U+0301): one backspace removes only the combining acute U+0301. The visible outputcafelooks like the previous character was deleted, but the user expected the entireégrapheme to go."👨👩👦"(eight code units, a ZWJ sequence): one backspace removes only the tail of the last surrogate pair, leaving a malformed ZWJ sequence that may render as"👨👩"with a dangling ZWJ or as garbage.3.
text_diff.find_common_prefixcan split a surrogate paireditor/text_diff.mbt:166:s1[i] == s2[i]compares UTF-16 code units. Considers1 = "😀A"ands2 = "😁B": they share the same high surrogate U+D83D, so this function returnsi == 1. Combined withfind_common_suffix_after_prefix, the generatedEditcuts the diff inside a surrogate pair, producingEditpayloads that contain lone surrogates.This affects
SyncEditor::set_text→compute_text_change, which is the path used by external bridges (e.g., ProseMirror), so external text coming in via the bridge can produce malformed edits.4. Test suite has no non-ASCII input
None of
editor/sync_editor_*_test.mbt,editor/sync_editor_text_wbtest.mbt, oreditor/text_diff_test.mbtcontains emoji, CJK, combining marks, or ZWJ sequences. (ephemeral_wbtest.mbthas a single Japanese string for serialization testing only.) Consequently, all of the bugs above are not caught by the existing tests — they are latent bugs that will surface the first time a user types non-ASCII.5. API documentation does not specify the unit of
positiondocs/development/API_REFERENCE.mddocumentsmove_cursor(position : Int)andget_cursor() -> Intwithout stating whatIntcounts (code units, code points, or graphemes). This is an unspecified public API contract.Why fix this now
"こんにちは"or"🎉", the editor starts producing malformed state.positionis. Fixing this early avoids breaking external consumers later.Direction of the fix
Unit policy
Distinguish three units explicitly in documentation and type design:
Int(unchanged)Int, semantically tagged; or a newGraphemeOffsetopaque typeAPI sketch
Pull UAX #29 grapheme segmentation from an external library (tentatively moji, see Related Issues) and apply the following edits:
Cursor clamping after insert:
New public methods for arrow-key navigation:
Staged implementation
Step 1: Add failing tests that expose the current behavior.
Independent of the moji work, commit tests that document the broken behavior:
Marking them with an xfail convention makes progress visible as fixes land.
Step 2: Depend on moji and fix the call sites.
moji is an in-progress MoonBit UAX #29 implementation. Once its API stabilizes, vendor or depend on it.
Planned moji API surface:
Step 3: Document the unit of
positionin API_REFERENCE.md.State explicitly that
SyncEditor::move_cursorandget_cursoruse grapheme offsets (externally), even though internally the integer is still a code-unit offset post-clamp. Reserve the option of introducing aGraphemeOffsetopaque type later.Step 4: Audit the ProseMirror bridge.
ProseMirror uses UTF-16 code unit offsets internally. Confirm where the conversion needs to happen and whether the bridge needs its own adapter.
Out of scope for this issue
"café"can be encoded two different ways and they are distinct CRDT strings today.to_lower/to_upperfor non-ASCII). Separate issue.Related issues
Document::insertaborts uncatchably on non-BMP input viaString::submid-surrogate; concurrent merge can also split surrogate pairs across CRDT operations — dowdiness/event-graph-walker#31. Fix in PR #35 (per-codepoint insert + Op content validation). This is a CRDT-layer issue that cannot be fixed from Canopy alone; it sits below the editor layer covered by this issue.dowdiness/moji#???(or tag/link once the repository exists).References
Checklist
insert,backspace,set_text,text_diffpositionunit indocs/development/API_REFERENCE.md(docs(api): annotate position units (#216 step 3) #241)SyncEditor::backspaceto grapheme boundary once moji is availableSyncEditor::insertto grapheme boundarytext_diff::find_common_prefix/find_common_suffix_after_prefixgrapheme-safemove_cursor_left_grapheme/_right_graphemeFor implementers
Current status
Suggested PR breakdown
editor/sync_editor_text_wbtest.mbtandeditor/text_diff_test.mbt. Do not modify implementation. The tests demonstrate the broken behavior, split into two styles per the Status section:inspectpinning for BMP combining marks;panicprefix for surrogate-pair inputs that abort in eg-walker. → Open as test(editor): pin current non-ASCII broken behavior (#216 step 1) #239.docs/development/API_REFERENCE.md.Pre-work for PR 1
// EXPECTED TO FAIL — see issue #<this>and useinspectwith the current (broken) output, with a comment explaining what the correct output should be. This makes the test both documenting and not-red.SyncEditor::insertSyncEditor::backspaceSyncEditor::deleteSyncEditor::move_cursortext_diff::compute_edit"あ","中""😀","𠮷""e\u{0301}"vs precomposed"é""👨👩👦""🇯🇵"Constraints
event-graph-walker/submodule). That is covered by dowdiness/event-graph-walker#31.moon check && moon testbefore submitting; new tests expected to fail should be explicitly marked.