Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Specify behavior for APIs that take natural-language input as opposed to producing natural-language output #13

Open
hsivonen opened this issue Sep 11, 2023 · 6 comments

Comments

@hsivonen
Copy link
Member

hsivonen commented Sep 11, 2023

Intl.Collator and Intl.Segmenter differ from the other APIs by taking natural-language input as opposed to producing natural-language output. These already have a root in implementations, but ECMA-402 currently lacks an explicit way to invoke it.

At least for the collator and the grapheme mode of the segmenter, it's clear what the generic behavior should be.

Collator

For the collator, it should be the CLDR root collation with unified ideographs ordered either by block and then by code point (ICU4C/ICU4X implicithan data, used by Chrome and Firefox) or by radical-stroke (ICU4C/ICU4X unihan data, used by Safari).

Note that the two options are indistinguishable unless the comparison is decided by comparing two ideographs that are from different blocks. It's unfortunate to have two alternative behaviors, but I don't think it's realistic to expect browsers that internalize the binary size of the root collation to carry the larger full radical-stroke order data and I also think it isn't realistic to ask browsers that delegate to system ICU to get system ICU to switch to the in-principle-less-correct order.

As for why CLDR root collation instead of raw DUCET, existing implementations use a library (ICU4C) or might migrate to a library (ICU4X) that are built around the CLDR root collation and don't provide raw DUCET.

If the implementation chooses to carry the eor and emoji tailorings of the root, these should be reachable via the generic language tag: und-u-co-eor and und-u-co-emoji. (Rationale: It's bad to teach Web developers that e.g. en-u-co-emoji works when it working is incidental due to the root collation being valid for English. E.g. fi-u-co-emoji does not work, as implementations do not today support combining a language-specific tailoring and language-independent root tailoring. As long as the emoji order isn't folded into the root itself (and, unfortunately, there exists non-Web-relevant ICU4C/ICU4J API surface that indicates against such folding into the root), it's better to teach that you get the emoji tailoring via und or zxx than via the languages for which the root happens to be valid without tailoring like en or fr.)

Grapheme segmenter

For the grapheme mode of the segmenter, it's pretty obvious that the generic behavior should be segmentation to UAX 29 extended grapheme clusters without tailorings.

Word/Sentence/Line segmenters

For the other segmenter modes, it seems to me that vaguely-specified behavior would be more useful and more implementable in the context of existing libraries than truly well-specified behavior. As I understand it, ICU4X as its main mode and ICU4C as its catch-all mode implement generic (that either already are or could be well-specified) rules for languages that use spaces to separate words and for well-known languages that do not use spaces dispatch based on the script to either dictionary-based breaking (Chinese and Japanese combined into one dictionary) or machine learning models (Khmer, Lao, Myanmar, and Thai) such that all these are in effect in one segmenter object.

It seems to me that the useful way to specify generic mode would be to say that the non-grapheme segmenters shall comply with UAX 29 and UAX 14 potentially with tailorings but shall not enable tailorings that the implementation doesn't enable for all languages.

That is, e.g. semicolon having sentence-ending question mark semantics for Greek or the Finnish/Swedish word break suppression for in-word colon shall not be enabled in the generic mode since they aren't enabled when requesting e.g. English, but the not-well-specified behaviors that can be dispatched on script and cater to languages that don't use spaces and that don't interfere with other space-using languages would still be enabled. That is, the Chinese+Japanese dictionary and the learned models for Khmer, Lao, Myanmar, and Thai should not be turned off in the generic mode even though their contents aren't given by a spec. (However, if the segmenter chooses to enable Greek semicolon behavior by guessing from the script of the surrounding letters instead of deciding from the requested language or chooses to enable Finnish/Swedish-compatible colon unbreakability for all languages, then the generic mode should retain those behaviors.)

@hsivonen
Copy link
Member Author

@aethanyc @makotokato Does the formulation about segmenters look OK to you?

@graphemecluster
Copy link

I expect ideographs to always be sorted in radical-stroke order. Normal users know nothing about blocks, and users find it weird seeing a character with dense strokes (say, near the end of URO in BMP) just preceding a simple one (say, at the beginning of Plane 2) with little strokes. Browsers already stores a large amount of data from CLDR, storing a few more arrays should not be a problem. Plus, as mentioned, characters are already sorted in radical-stroke order in each block, so the job is merely to merge a few sorted arrays into one. Hence, we can always use algorithms to optimize the radical-stroke data, or else a single array with <100K elements shouldn’t bother.

@hsivonen
Copy link
Member Author

I expect ideographs to always be sorted in radical-stroke order.

Android and Chrome decided to ship the other order described above, so Web authors cannot, in practice, rely on the radical-stroke order holding across Unicode blocks.

I wrote about the two alternative orders above, because I don't want to put fiction in a spec. The current situation, which isn't great in principle but works in practice, is that Chrome and Safari differ on this point, so it's fair to inform other implementations of the situation so that they can make an informed choice given that situation.

Normal users know nothing about blocks

The default collation orders for the CJK locales don't use radical-stroke, so radical-stroke isn't the order normally facing CJK-locale users.

Browsers already stores a large amount of data from CLDR

That there is already a lot of CLDR data isn't a persuasive argument for carrying even more—the user-facing benefit of a byte of CLDR data varies a lot depending on what data it is.

Anyway, to change the state of things, you'd need to convince Chrome to carry the ICU4C "unihan" root data variant instead of carrying the "implicithan" variant (on all operating systems).

@noinkling
Copy link

The ability to (explicitly) use the CLDR root collation for something resembling a language-agnostic sort order is something I would find welcome and have inquired about before (with no response): tc39/ecma402#549

Eventually I discovered that en can be used as a surrogate since it currently doesn't have any tailorings, but it took some digging, and obviously could change in the future.

@srl295
Copy link
Member

srl295 commented Jan 29, 2024

@noinkling und is what I would expect you to use there.

@noinkling
Copy link

@noinkling und is what I would expect you to use there.

Right, the point is it doesn't work, at least in any implementation I've tried. und always resolves to something else.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants