Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bizarre behavior of string.StartsWith #72770

Closed
zvrba opened this issue Jul 25, 2022 · 9 comments
Closed

Bizarre behavior of string.StartsWith #72770

zvrba opened this issue Jul 25, 2022 · 9 comments
Labels
area-System.Globalization needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration
Milestone

Comments

@zvrba
Copy link

zvrba commented Jul 25, 2022

Description

Now, I'm aware of https://docs.microsoft.com/en-us/dotnet/standard/base-types/string-comparison-net-5-plus

Please see the screenshot ("Immediate window" in VS debugger) and comments below.

StartsWithDebug

Reproduction Steps

Set locale to norwegian bokmål (NOB). "aa".StartsWith("a") returns false, which might be explainable with the breaking behavior I linked to above. However, "aa".StartsWith("å") returns false as well

Expected behavior

At least "aa".StartsWith("å") should then return true as "å" is "linguistically the same" as "aa". Otherwise, you tell me. The observed behavior totally breaks the expectation of a "string being a sequence of characters". It almost makes me want to replace all string types with List<char>.

Actual behavior

Please see the screenshots. Totally crazy, I spent two hours diagnosing the issue.

Regression?

No response

Known Workarounds

Explicitly use StringComparison.Ordinal. Alternately, set the program's culture to invariant, like this System.Globalization.CultureInfo.CurrentCulture = System.Globalization.CultureInfo.InvariantCulture;

Configuration

Windows 11, .net 6.0.5, x64. Mixed locale: english as display language, several keyboard layouts installed (ENG and NOB).

Other information

No response

@ghost ghost added the untriaged New issue has not been triaged by the area owner label Jul 25, 2022
@ghost
Copy link

ghost commented Jul 25, 2022

Tagging subscribers to this area: @dotnet/area-system-globalization
See info in area-owners.md if you want to be subscribed.

Issue Details

Description

Now, I'm aware of https://docs.microsoft.com/en-us/dotnet/standard/base-types/string-comparison-net-5-plus

Please see the screenshot ("Immediate window" in VS debugger) and comments below.

StartsWithDebug

Reproduction Steps

Set locale to norwegian bokmål (NOB). "aa".StartsWith("a") returns false, which might be explainable with the breaking behavior I linked to above. However, "aa".StartsWith("å") returns false as well

Expected behavior

At least "aa".StartsWith("å") should then return true as "å" is "linguistically the same" as "aa". Otherwise, you tell me. The observed behavior totally breaks the expectation of a "string being a sequence of characters".

Actual behavior

Please see the screenshots. Totally crazy, I spent two hours diagnosing the issue.

Regression?

No response

Known Workarounds

Explicitly use StringComparison.Ordinal.

Configuration

Windows 11, .net 6.0.5, x64. Mixed locale: english as display language, several keyboard layouts installed (ENG and NOB).

Other information

No response

Author: zvrba
Assignees: -
Labels:

area-System.Globalization

Milestone: -

@GalaxiaGuy
Copy link

A simple repro in dotnetfiddle: https://dotnetfiddle.net/Y3jLQJ

@tarekgh
Copy link
Member

tarekgh commented Jul 25, 2022

@zvrba thanks for your report.

It is the Unicode collation behavior for the Norwegian culture that a is not a prefix for aa. If you disagree with this behavior, you may log a ticket to ICU https://unicode-org.atlassian.net/jira/software/c/projects/ICU/issues/.

For å case, you are right this should be a prefix of aa. Part of the change when we switched to using ICU in the .NET is you need to supply the compare option to make this work. You can do the following:

            CultureInfo ci = CultureInfo.GetCultureInfo("nb-NO");
            Console.WriteLine(ci.CompareInfo.IsPrefix("aa", "å", CompareOptions.IgnoreNonSpace));

This should make everything work fine. Let me know if you have any more questions, I can help you with them.

@tarekgh tarekgh removed the untriaged New issue has not been triaged by the area owner label Jul 25, 2022
@tarekgh tarekgh added this to the Future milestone Jul 25, 2022
@tarekgh tarekgh added the needs-author-action An issue or pull request that requires more info or actions from the author. label Jul 25, 2022
@ghost
Copy link

ghost commented Jul 25, 2022

This issue has been marked needs-author-action and may be missing some important information.

@zvrba
Copy link
Author

zvrba commented Jul 26, 2022

Let me know if you have any more questions, I can help you with them.

Hi. Thanks for the reply. I do have a question: I want string to behave as a sequence of "characters". What should I do? As an example, what should a program running under, say Korean culture, do to process French text, without knowing that the text is "French"? Two "visually same" strings should behave "sanely" wrt == , StartsWith and such, on char-by-char basis. I do not care about sort order, as long as it's consistent.

Also, I'm questioning the decision that StartsWith should use collation. I do not expect sorting rules (collation) to have effect when a method that works on partial strings is invoked.

Obviously, I'm not a unicode expert and really do not want to become one. The program I'm writing has to process Unicode strings but the processing should be neutral wrt user's OS locale. As another example, a person running the program under German locale should be able to "sanely" search (wrt == and StartsWith and Contains, etc.) for French names entered by a French person under French locale [1]. Data is exchanged through a NVARCHAR field in the database. What to do?

[1] Now, how does a German enter French characters under German locale/culture into the search box? Copy-paste!

EDIT: Another inconsistency. Look

"aax".StartsWith("a")
false
"aax".Contains("a")
true
"aax".IndexOf("a")
-1
"aax"[0] == 'a'
true

No matter how hard I try, I cannot make sense of this. (Yes, I know, there is an explanation. But the rather involved explanation does not match the programmer's expectations about how these methods should behave wrt each other. When IndexOf returns -1, Contains should return false as well, no? When "aax"[0] == 'a' returns true, StartsWith("a") should as well.)

@ghost ghost added needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration and removed needs-author-action An issue or pull request that requires more info or actions from the author. labels Jul 26, 2022
@GalaxiaGuy
Copy link

To treat the string like an array of chars, use overloads with StringComparison.Ordinal:

"aa".StartsWith("a", StringComparison.Ordinal)

To treat strings in a way that is reasonably logical for English, use StringComparison.InvariantCulture.

"aa".StartsWith("a", StringComparison.InvariantCulture)

@Joe4evr
Copy link
Contributor

Joe4evr commented Jul 26, 2022

When IndexOf returns -1, Contains should return false as well, no? When "aax"[0] == 'a' returns true, StartsWith("a") should as well.)

Sadly, that's not really the case. Here's a bit explained by Jon Skeet about how IndexOf can be problematic (and much of that extends to all string methods).

Really, the best advice when it comes to .NET string manipulation is and has always been: Never rely on a method's default behavior; Always supply a comparison type at your callsite (even if the supplied comparison matches that method's default) just so that you're clear and consistent and not getting surprising behavior like this. You can enable the code analysis rules CA1305 and CA1304 to help you catch those callsites and improve your code quality.

@skyoxZ
Copy link
Contributor

skyoxZ commented Jul 26, 2022

I would suggest that string.StartsWith(string) (and some other methods of string) uses StringComparison.Ordinal by default. These are the most basic APIs but now their real behaviors are super complicated, especially for a beginner.

@tarekgh
Copy link
Member

tarekgh commented Jul 26, 2022

@skyoxZ please have a look at dotnet/designs#207 for more info.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-System.Globalization needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration
Projects
None yet
Development

No branches or pull requests

5 participants