Skip to content

Add lib/moji — UAX #29 grapheme/word segmentation library (unblocks #216 step 2) #250

@dowdiness

Description

@dowdiness

Goal

Add a new in-repo MoonBit library at lib/moji/ that implements UAX #29 grapheme-cluster and word-segmentation queries. moji unblocks the (moji-blocked) items in TODO.md §16 Unicode Text Correctness and is canopy's path to closing the bimodal non-ASCII failure surface tracked in #216.

Context

#216 step 1 pinned the failure modes (#239), step 3 documented the public position-units contract (#241), and step 4 audited the bridges (#242). Step 2 is the actual fix, which requires UAX #29 awareness in the editor layer. We don't have that today — String[Int] indexing is UTF-16, not grapheme.

The full per-call-site derivation lives at:

Both docs were derived from the call sites in editor/sync_editor_text.mbt, editor/text_diff.mbt, lang/markdown/edits/compute_markdown_edit.mbt, and examples/ideal/web/src/bridge.ts.

Initial location: lib/moji/

moji starts as a canopy-internal workspace member (alongside lib/text-change, lib/btree, lib/zipper). Path-deps from canopy editor packages, valtio, and loom will pick it up via the same mechanism lib/text-change uses.

If moji proves general-purpose enough to extract to a standalone repo later, the workspace-member shape makes the move mechanical (move directory + update path-deps).

API surface (preferred shape)

Four required functions:

pub fn prev_grapheme_boundary(text : String, pos : Int) -> Int  // at-or-before
pub fn next_grapheme_boundary(text : String, pos : Int) -> Int  // at-or-after
pub fn prev_word_boundary(text : String, pos : Int) -> Int       // at-or-before
pub fn next_word_boundary(text : String, pos : Int) -> Int       // at-or-after

All pos arguments are UTF-16 code-unit offsets (matching MoonBit's String[Int] indexing). The functions MUST accept positions inside surrogate pairs and inside multi-codepoint clusters without aborting — canopy uses the pos ± 1 pattern for strict-step cursor movement.

Two ergonomics helpers (required if cheap):

pub fn is_grapheme_boundary(text : String, pos : Int) -> Bool
pub fn grapheme_clusters(text : String) -> Iter[(Int, Int)]

Optional:

pub fn grapheme_clusters_reverse(text : String) -> Iter[(Int, Int)]

Fallback shape

If accepting arbitrary UTF-16 offsets is too invasive (a pure UAX #29 segmenter typically operates on well-formed codepoint streams), an alternative shape works:

pub fn grapheme_boundaries(text : String) -> Array[Int]
pub fn word_boundaries(text : String) -> Array[Int]

Canopy then binary-searches the boundary array for point queries. See spec §2.1 for the trade.

Out of scope

Explicitly NOT in moji's scope: normalization, bidi, casing, display width, line/sentence boundaries, script detection, collation, well-formedness validation, JS bindings, CRDT position conversion. See spec §3.

Test vectors

Canopy will exercise the integration with fixtures covering ASCII, BMP combining marks, non-BMP single, surrogate pairs, ZWJ family, skin-tone modifiers, variation selectors, RGI ZWJ profession emoji, regional indicators, Hangul (jamo + precomposed), Indic virama conjuncts, CRLF, and degenerate edge cases (empty string, leading combining mark). Full table in spec §4.1.

moji's own test suite presumably exercises UAX #29 reference data (GraphemeBreakTest.txt, WordBreakTest.txt); the fixtures above are for the canopy-side integration check.

Open questions for moji

These don't block scaffolding the library but should be settled before canopy integrates:

  1. Offset-tolerance contract — preferred shape (point queries that accept any UTF-16 offset) vs. fallback (boundary arrays + canopy-side binary search). Implementation cost should drive the choice.
  2. Unicode version — pin a version moji tracks; canopy will pin the same in tests (especially for Indic conjuncts, which moved between versions).
  3. Grapheme reverse iterator (function 7 above) — provide if natural in the implementation, otherwise canopy materialises a forward array.

Open questions canopy will resolve at integration

These are canopy-side decisions surfaced for awareness — moji's API is unaffected:

  • Cursor unit-storage (UTF-16 vs item-space vs grapheme-ordinal — see spec §6.1)
  • Snap-direction defaults per call site (spec §6.2 / direction-ambiguity summary)
  • Word-navigation policy on top of raw UAX boundaries (spec §6.3)

Suggested phasing

  1. Scaffold lib/moji/ with moon.pkg.json, moon.mod.json, workspace registration.
  2. Implement grapheme-boundary segmentation + tests against GraphemeBreakTest.txt (or a canonical subset).
  3. Implement word-boundary segmentation + tests.
  4. Choose contract shape (preferred vs fallback per the open question above) and implement the public API.
  5. Canopy integration PR wires moji into editor/sync_editor_text.mbt, editor/text_diff.mbt, lib/text-change/text_change.mbt, lang/markdown/edits/compute_markdown_edit.mbt, and the FFI variant naming. Closes the (moji-blocked) items in TODO §16.

Steps 1–4 are moji-internal. Step 5 is the canopy-side integration that depends on this issue.

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions