Goal
Add a new in-repo MoonBit library at lib/moji/ that implements UAX #29 grapheme-cluster and word-segmentation queries. moji unblocks the (moji-blocked) items in TODO.md §16 Unicode Text Correctness and is canopy's path to closing the bimodal non-ASCII failure surface tracked in #216.
Context
#216 step 1 pinned the failure modes (#239), step 3 documented the public position-units contract (#241), and step 4 audited the bridges (#242). Step 2 is the actual fix, which requires UAX #29 awareness in the editor layer. We don't have that today — String[Int] indexing is UTF-16, not grapheme.
The full per-call-site derivation lives at:
Both docs were derived from the call sites in editor/sync_editor_text.mbt, editor/text_diff.mbt, lang/markdown/edits/compute_markdown_edit.mbt, and examples/ideal/web/src/bridge.ts.
Initial location: lib/moji/
moji starts as a canopy-internal workspace member (alongside lib/text-change, lib/btree, lib/zipper). Path-deps from canopy editor packages, valtio, and loom will pick it up via the same mechanism lib/text-change uses.
If moji proves general-purpose enough to extract to a standalone repo later, the workspace-member shape makes the move mechanical (move directory + update path-deps).
API surface (preferred shape)
Four required functions:
pub fn prev_grapheme_boundary(text : String, pos : Int) -> Int // at-or-before
pub fn next_grapheme_boundary(text : String, pos : Int) -> Int // at-or-after
pub fn prev_word_boundary(text : String, pos : Int) -> Int // at-or-before
pub fn next_word_boundary(text : String, pos : Int) -> Int // at-or-after
All pos arguments are UTF-16 code-unit offsets (matching MoonBit's String[Int] indexing). The functions MUST accept positions inside surrogate pairs and inside multi-codepoint clusters without aborting — canopy uses the pos ± 1 pattern for strict-step cursor movement.
Two ergonomics helpers (required if cheap):
pub fn is_grapheme_boundary(text : String, pos : Int) -> Bool
pub fn grapheme_clusters(text : String) -> Iter[(Int, Int)]
Optional:
pub fn grapheme_clusters_reverse(text : String) -> Iter[(Int, Int)]
Fallback shape
If accepting arbitrary UTF-16 offsets is too invasive (a pure UAX #29 segmenter typically operates on well-formed codepoint streams), an alternative shape works:
pub fn grapheme_boundaries(text : String) -> Array[Int]
pub fn word_boundaries(text : String) -> Array[Int]
Canopy then binary-searches the boundary array for point queries. See spec §2.1 for the trade.
Out of scope
Explicitly NOT in moji's scope: normalization, bidi, casing, display width, line/sentence boundaries, script detection, collation, well-formedness validation, JS bindings, CRDT position conversion. See spec §3.
Test vectors
Canopy will exercise the integration with fixtures covering ASCII, BMP combining marks, non-BMP single, surrogate pairs, ZWJ family, skin-tone modifiers, variation selectors, RGI ZWJ profession emoji, regional indicators, Hangul (jamo + precomposed), Indic virama conjuncts, CRLF, and degenerate edge cases (empty string, leading combining mark). Full table in spec §4.1.
moji's own test suite presumably exercises UAX #29 reference data (GraphemeBreakTest.txt, WordBreakTest.txt); the fixtures above are for the canopy-side integration check.
Open questions for moji
These don't block scaffolding the library but should be settled before canopy integrates:
- Offset-tolerance contract — preferred shape (point queries that accept any UTF-16 offset) vs. fallback (boundary arrays + canopy-side binary search). Implementation cost should drive the choice.
- Unicode version — pin a version moji tracks; canopy will pin the same in tests (especially for Indic conjuncts, which moved between versions).
- Grapheme reverse iterator (function 7 above) — provide if natural in the implementation, otherwise canopy materialises a forward array.
Open questions canopy will resolve at integration
These are canopy-side decisions surfaced for awareness — moji's API is unaffected:
- Cursor unit-storage (UTF-16 vs item-space vs grapheme-ordinal — see spec §6.1)
- Snap-direction defaults per call site (spec §6.2 / direction-ambiguity summary)
- Word-navigation policy on top of raw UAX boundaries (spec §6.3)
Suggested phasing
- Scaffold
lib/moji/ with moon.pkg.json, moon.mod.json, workspace registration.
- Implement grapheme-boundary segmentation + tests against
GraphemeBreakTest.txt (or a canonical subset).
- Implement word-boundary segmentation + tests.
- Choose contract shape (preferred vs fallback per the open question above) and implement the public API.
- Canopy integration PR wires moji into
editor/sync_editor_text.mbt, editor/text_diff.mbt, lib/text-change/text_change.mbt, lang/markdown/edits/compute_markdown_edit.mbt, and the FFI variant naming. Closes the (moji-blocked) items in TODO §16.
Steps 1–4 are moji-internal. Step 5 is the canopy-side integration that depends on this issue.
References
Goal
Add a new in-repo MoonBit library at
lib/moji/that implements UAX #29 grapheme-cluster and word-segmentation queries. moji unblocks the(moji-blocked)items in TODO.md §16 Unicode Text Correctness and is canopy's path to closing the bimodal non-ASCII failure surface tracked in #216.Context
#216 step 1 pinned the failure modes (#239), step 3 documented the public position-units contract (#241), and step 4 audited the bridges (#242). Step 2 is the actual fix, which requires UAX #29 awareness in the editor layer. We don't have that today —
String[Int]indexing is UTF-16, not grapheme.The full per-call-site derivation lives at:
docs/plans/2026-05-10-moji-api-spec.md— the API specification (clean cut)docs/plans/2026-05-10-moji-api-derivation.md— full derivation with audit trail across review passesBoth docs were derived from the call sites in
editor/sync_editor_text.mbt,editor/text_diff.mbt,lang/markdown/edits/compute_markdown_edit.mbt, andexamples/ideal/web/src/bridge.ts.Initial location:
lib/moji/moji starts as a canopy-internal workspace member (alongside
lib/text-change,lib/btree,lib/zipper). Path-deps from canopy editor packages,valtio, andloomwill pick it up via the same mechanismlib/text-changeuses.If moji proves general-purpose enough to extract to a standalone repo later, the workspace-member shape makes the move mechanical (move directory + update path-deps).
API surface (preferred shape)
Four required functions:
All
posarguments are UTF-16 code-unit offsets (matching MoonBit'sString[Int]indexing). The functions MUST accept positions inside surrogate pairs and inside multi-codepoint clusters without aborting — canopy uses thepos ± 1pattern for strict-step cursor movement.Two ergonomics helpers (required if cheap):
Optional:
Fallback shape
If accepting arbitrary UTF-16 offsets is too invasive (a pure UAX #29 segmenter typically operates on well-formed codepoint streams), an alternative shape works:
Canopy then binary-searches the boundary array for point queries. See spec §2.1 for the trade.
Out of scope
Explicitly NOT in moji's scope: normalization, bidi, casing, display width, line/sentence boundaries, script detection, collation, well-formedness validation, JS bindings, CRDT position conversion. See spec §3.
Test vectors
Canopy will exercise the integration with fixtures covering ASCII, BMP combining marks, non-BMP single, surrogate pairs, ZWJ family, skin-tone modifiers, variation selectors, RGI ZWJ profession emoji, regional indicators, Hangul (jamo + precomposed), Indic virama conjuncts, CRLF, and degenerate edge cases (empty string, leading combining mark). Full table in spec §4.1.
moji's own test suite presumably exercises UAX #29 reference data (
GraphemeBreakTest.txt,WordBreakTest.txt); the fixtures above are for the canopy-side integration check.Open questions for moji
These don't block scaffolding the library but should be settled before canopy integrates:
Open questions canopy will resolve at integration
These are canopy-side decisions surfaced for awareness — moji's API is unaffected:
Suggested phasing
lib/moji/withmoon.pkg.json,moon.mod.json, workspace registration.GraphemeBreakTest.txt(or a canonical subset).editor/sync_editor_text.mbt,editor/text_diff.mbt,lib/text-change/text_change.mbt,lang/markdown/edits/compute_markdown_edit.mbt, and the FFI variant naming. Closes the(moji-blocked)items in TODO §16.Steps 1–4 are moji-internal. Step 5 is the canopy-side integration that depends on this issue.
References