to_lowercase only uses unconditional parts of unicode.org's special-casing

https://github.com/rust-lang/rust/blob/f9157f5b869fdb14308eaf6778d01ee3d0e1268a/src/libcore/unicode/unicode.py#L168-169

Since #25800, to_lowercase uses unicode.org's SpecialCasing.txt.
However, it only follows unconditional rules from this file. One "interesting" case is:
1. The main UnicodeData.txt file says that the lowercase of 'İ' (0130, Latin capital letter I with dot above) should be 'i' (0069, good-old boring ASCII Latin small letter i).
2. SpecialCasing.txt adds an unconditional rule that 'İ' (0130) should in fact be lowercased to 'i̇' (0069 Latin small letter i + 0307 combining dot above).
3. SpecialCasing.txt then adds a rule for tr (Turkish) and az (Azerbaijani) where 'İ' (0130) should now be lowercased to just 'i' (0069 Latin small letter i) -- There are other related rules, dotted-i's match and non-dotted-i's match too.

I think that (2) only makes sense when accompanied by (3): They are in the same file, touching the same character; (3) for tr/az and (2) for other languages.
But because only unconditional rules are handled, we end up with something hybrid that was intended for non-tr/az languages in contrast with tr/az, while ignoring the default language-independent specification from UnicodeData.txt.

I realize that it's quite a corner case, and open to interpretation.
Also, SpecialCasing.txt contains other useful unconditional rules that are worth having, so it would be unfortunate to lose those.
And (2) does have the advantage of making lowercasing reversible; though it's not a goal of unicode AFAIK.

So in the end, I'm not sure if&how this should be fixed -- other than implementing conditions, which would require handling languages.

A compromise would be to ignore this one rule, by hard-coding an exception.
This restriction could be made less arbitrary, by saying: Unconditional rules are only accepted for characters that do **not** also have conditional rules.

I understand if this won't be fixed, I at least wanted to bring attention to this case.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

to_lowercase only uses unconditional parts of unicode.org's special-casing #51362

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

to_lowercase only uses unconditional parts of unicode.org's special-casing #51362

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions