Skip to content

to_lowercase only uses unconditional parts of unicode.org's special-casing #51362

Open
@squelart

Description

@squelart

https://github.com/rust-lang/rust/blob/f9157f5b869fdb14308eaf6778d01ee3d0e1268a/src/libcore/unicode/unicode.py#L168-169

Since #25800, to_lowercase uses unicode.org's SpecialCasing.txt.
However, it only follows unconditional rules from this file. One "interesting" case is:

  1. The main UnicodeData.txt file says that the lowercase of 'İ' (0130, Latin capital letter I with dot above) should be 'i' (0069, good-old boring ASCII Latin small letter i).
  2. SpecialCasing.txt adds an unconditional rule that 'İ' (0130) should in fact be lowercased to 'i̇' (0069 Latin small letter i + 0307 combining dot above).
  3. SpecialCasing.txt then adds a rule for tr (Turkish) and az (Azerbaijani) where 'İ' (0130) should now be lowercased to just 'i' (0069 Latin small letter i) -- There are other related rules, dotted-i's match and non-dotted-i's match too.

I think that (2) only makes sense when accompanied by (3): They are in the same file, touching the same character; (3) for tr/az and (2) for other languages.
But because only unconditional rules are handled, we end up with something hybrid that was intended for non-tr/az languages in contrast with tr/az, while ignoring the default language-independent specification from UnicodeData.txt.

I realize that it's quite a corner case, and open to interpretation.
Also, SpecialCasing.txt contains other useful unconditional rules that are worth having, so it would be unfortunate to lose those.
And (2) does have the advantage of making lowercasing reversible; though it's not a goal of unicode AFAIK.

So in the end, I'm not sure if&how this should be fixed -- other than implementing conditions, which would require handling languages.

A compromise would be to ignore this one rule, by hard-coding an exception.
This restriction could be made less arbitrary, by saying: Unconditional rules are only accepted for characters that do not also have conditional rules.

I understand if this won't be fixed, I at least wanted to bring attention to this case.

Metadata

Metadata

Assignees

No one assigned

    Labels

    A-UnicodeArea: UnicodeC-discussionCategory: Discussion or questions that doesn't represent real issues.T-libs-apiRelevant to the library API team, which will review and decide on the PR/issue.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions