-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Specify behavior for APIs that take natural-language input as opposed to producing natural-language output #13
Comments
@aethanyc @makotokato Does the formulation about segmenters look OK to you? |
I expect ideographs to always be sorted in radical-stroke order. Normal users know nothing about blocks, and users find it weird seeing a character with dense strokes (say, near the end of URO in BMP) just preceding a simple one (say, at the beginning of Plane 2) with little strokes. Browsers already stores a large amount of data from CLDR, storing a few more arrays should not be a problem. Plus, as mentioned, characters are already sorted in radical-stroke order in each block, so the job is merely to merge a few sorted arrays into one. Hence, we can always use algorithms to optimize the radical-stroke data, or else a single array with <100K elements shouldn’t bother. |
Android and Chrome decided to ship the other order described above, so Web authors cannot, in practice, rely on the radical-stroke order holding across Unicode blocks. I wrote about the two alternative orders above, because I don't want to put fiction in a spec. The current situation, which isn't great in principle but works in practice, is that Chrome and Safari differ on this point, so it's fair to inform other implementations of the situation so that they can make an informed choice given that situation.
The default collation orders for the CJK locales don't use radical-stroke, so radical-stroke isn't the order normally facing CJK-locale users.
That there is already a lot of CLDR data isn't a persuasive argument for carrying even more—the user-facing benefit of a byte of CLDR data varies a lot depending on what data it is. Anyway, to change the state of things, you'd need to convince Chrome to carry the ICU4C "unihan" root data variant instead of carrying the "implicithan" variant (on all operating systems). |
The ability to (explicitly) use the CLDR root collation for something resembling a language-agnostic sort order is something I would find welcome and have inquired about before (with no response): tc39/ecma402#549 Eventually I discovered that |
@noinkling |
Right, the point is it doesn't work, at least in any implementation I've tried. |
Intl.Collator
andIntl.Segmenter
differ from the other APIs by taking natural-language input as opposed to producing natural-language output. These already have a root in implementations, but ECMA-402 currently lacks an explicit way to invoke it.At least for the collator and the grapheme mode of the segmenter, it's clear what the generic behavior should be.
Collator
For the collator, it should be the CLDR root collation with unified ideographs ordered either by block and then by code point (ICU4C/ICU4X implicithan data, used by Chrome and Firefox) or by radical-stroke (ICU4C/ICU4X unihan data, used by Safari).
Note that the two options are indistinguishable unless the comparison is decided by comparing two ideographs that are from different blocks. It's unfortunate to have two alternative behaviors, but I don't think it's realistic to expect browsers that internalize the binary size of the root collation to carry the larger full radical-stroke order data and I also think it isn't realistic to ask browsers that delegate to system ICU to get system ICU to switch to the in-principle-less-correct order.
As for why CLDR root collation instead of raw DUCET, existing implementations use a library (ICU4C) or might migrate to a library (ICU4X) that are built around the CLDR root collation and don't provide raw DUCET.
If the implementation chooses to carry the
eor
andemoji
tailorings of the root, these should be reachable via the generic language tag:und-u-co-eor
andund-u-co-emoji
. (Rationale: It's bad to teach Web developers that e.g.en-u-co-emoji
works when it working is incidental due to the root collation being valid for English. E.g.fi-u-co-emoji
does not work, as implementations do not today support combining a language-specific tailoring and language-independent root tailoring. As long as theemoji
order isn't folded into the root itself (and, unfortunately, there exists non-Web-relevant ICU4C/ICU4J API surface that indicates against such folding into the root), it's better to teach that you get theemoji
tailoring viaund
orzxx
than via the languages for which the root happens to be valid without tailoring likeen
orfr
.)Grapheme segmenter
For the grapheme mode of the segmenter, it's pretty obvious that the generic behavior should be segmentation to UAX 29 extended grapheme clusters without tailorings.
Word/Sentence/Line segmenters
For the other segmenter modes, it seems to me that vaguely-specified behavior would be more useful and more implementable in the context of existing libraries than truly well-specified behavior. As I understand it, ICU4X as its main mode and ICU4C as its catch-all mode implement generic (that either already are or could be well-specified) rules for languages that use spaces to separate words and for well-known languages that do not use spaces dispatch based on the script to either dictionary-based breaking (Chinese and Japanese combined into one dictionary) or machine learning models (Khmer, Lao, Myanmar, and Thai) such that all these are in effect in one segmenter object.
It seems to me that the useful way to specify generic mode would be to say that the non-grapheme segmenters shall comply with UAX 29 and UAX 14 potentially with tailorings but shall not enable tailorings that the implementation doesn't enable for all languages.
That is, e.g. semicolon having sentence-ending question mark semantics for Greek or the Finnish/Swedish word break suppression for in-word colon shall not be enabled in the generic mode since they aren't enabled when requesting e.g. English, but the not-well-specified behaviors that can be dispatched on script and cater to languages that don't use spaces and that don't interfere with other space-using languages would still be enabled. That is, the Chinese+Japanese dictionary and the learned models for Khmer, Lao, Myanmar, and Thai should not be turned off in the generic mode even though their contents aren't given by a spec. (However, if the segmenter chooses to enable Greek semicolon behavior by guessing from the script of the surrounding letters instead of deciding from the requested language or chooses to enable Finnish/Swedish-compatible colon unbreakability for all languages, then the generic mode should retain those behaviors.)
The text was updated successfully, but these errors were encountered: