Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improving default string globalization experience #207

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

GrabYourPitchforks
Copy link
Member

Draft design document for how we can flatten the learning curve for using string, removing globalization concepts for developers who needn't be exposed to them, and widening the pit of success for people calling string-based APIs.

This is a draft design document and does not represent a final plan or committed work.

See also:


To solve this problem for the .NET ecosystem at large, we propose to take the following actions on `string`.

1. Obsolete (as warning) all linguistic-by-default methods on `string` with a new __SYSLIBxxxx__ identifier.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd really like to see 1. be Create fixers to migrate call sites and 2. to be Obsolete (as warning)

Having a fixer is critical IMO and the most important part considering how widespread the existing methods are.


## A note on `StringComparison` and `StringComparer`

One notable aspect of this proposal is the complete restructuring of the `StringComparison` and `StringComparer` types. It is important to note that all members of these types will be preserved. The proposal here is to hide all existing fields / static properties and to create new "simpler" names _Normal_ and _IgnoreCase_. Both of these additions alias their ordinal siblings.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be it is worth mentioning whoever want to create a linguistic StringComparer, can call the extension method CompareInfo.GetStringComparer. This will be explicit as it will be created from the CompareInfo which show the intend of using the linguistic functionality.

- Ordinal = 4,
- OrdinalIgnoreCase = 5,
+ Normal = 4, // aliases Ordinal
+ IgnoreCase = 5, // aliases OrdinalIgnoreCase
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the replacement for CurrentCulture and InvariantCulture?

How will this impact cases like CompareOptions.NumericOrdering: dotnet/runtime#13979 (comment)?

How will this impact cases like RemoveStart and RemoveEnd: dotnet/runtime#14386

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To use CompareInfo instead for the operation. e.g. CultureInfo.CurrentCulture.CompareInfo.Compare(...). I think we need to revisit dotnet/runtime#13979 or at least will make NumericOrdering is cultural independent.


In reply to: 615181346 [](ancestors = 615181346)

Copy link
Member

@stephentoub stephentoub Apr 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why we would make this so much harder when you do want CurrentCulture{IgnoreCase}. The concern expressed in this doc is around the defaults and the inconsistency with defaults we currently expose. But with StringComparison/StringComparer, a developer is explicitly specifying which they want. I see only downside to changing this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kind with @stephentoub. Having Normal and IgnoreCase present on StringComparer seems sufficient as it maps to the methods. Using the enum/comparer is already the advanced mode, so hiding the other concepts feels a bit like we're trying too hard to be helpful.


The second characteristic requires a bit more explanation. Popular programming languages like Java, JavaScript, Go, Python, C/C++, and others treat string APIs as ordinal. .NET is unique in that we build globalization concepts directly into many basic `string` APIs. Unfortunately, evidence shows that .NET's behavior in this regard is largely unexpected and confusing to our developer audience.

[Java 16's `String` class](https://docs.oracle.com/en/java/javase/16/docs/api/java.base/java/lang/String.html) (which is the closest to .NET's `string` class) is a good example of this principle. Per their own documentation:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's good that we're looking to learn from other ecosystems. Is there anything we can learn from the other ecosystems you mention above, beyond your note that they separate string from globalization? (Go, Rust etc)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, absolutely. Most other ecosystems also enforce opinionated behaviors regarding UTF-8 string well-formedness. But that continues to be a point of contention for us. ;)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think an important thing to keep in mind about scenarios like Rust is that they design most of their types around the principle of "zero cost abstraction": https://doc.rust-lang.org/book/ch00-00-introduction.html#people-who-value-speed-and-stability

Rust is for people who crave speed and stability in a language. By speed, we mean the speed of the programs that you can create with Rust and the speed at which Rust lets you write them. The Rust compiler’s checks ensure stability through feature additions and refactoring. This is in contrast to the brittle legacy code in languages without these checks, which developers are often afraid to modify. By striving for zero-cost abstractions, higher-level features that compile to lower-level code as fast as code written manually, Rust endeavors to make safe code be fast code as well.

So while things like str are always sliced to valid UTF-8 (unless using one of the unchecked slicing methods), they ensure the relevant other components are exposed to allow working with and dealing with "raw parts", that is the bytes rather than runes (to put it into .NET terms) so power users have the ability to do things themselves when and as appropriate.

Likewise, I don't believe rust has a case insensitive comparer in std today. You need to pull it in through an external crate (basically nupkg) and specify it as the comparer to use in your match expressions (in this way, its more like IComparer, but where the implementation is a struct and done via generic specialization so its "zero cost" and inlining/etc happen).

1. Mark all linguistic-related APIs as `[EditorBrowsable(EditorBrowsableState.Never)]`, but do not obsolete them. (Exception: do not mark APIs if they're in the _System.Globalization_ namespace.)
2. Create analyzers to detect `Comparer<string>.Default`, `new SortedSet<string>()`, and similar code patterns that implicitly rely on the current culture.

### List of API obsoletions and additions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we going to release an updated version of netstandard or are the developers targetting multiple versions going to be faced with warnings everywhere?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the challenge here is we are going to have a new APIs which will be in .NET 6.0 and up. I don't think we can add these to the netstandard.

CC @terrajobst

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These wouldn't be in standard. If you're multi-targeting, I believe per the way this proposal is written, the only methods that would be marked [Obsolete] are methods that the code analyzers CA1307 and CA1310 would have flagged anyway as likely incorrect usage. A significant number of methods would be marked EB-never but wouldn't produce warnings.

@Happypig375
Copy link
Member

Happypig375 commented Apr 17, 2021

Can't we have e.g. CompareToNormal as opposed to CompareToString to align with members in StringComparison and StringComparer?

@Happypig375
Copy link
Member

Happypig375 commented Apr 17, 2021

With the naming convention above, we can omit repeating String in the method names and have xxx for char/StringComparison overloads, xxxNormal and xxxIgnoreCase instead of xxx for char/StringComparison overloads, xxxIgnoreCase for char overloads, xxxString, and xxxStringIgnoreCase.

@Happypig375
Copy link
Member

Maybe we can even provide char overloads for xxxNormal for uniformity.

@Happypig375
Copy link
Member

Also ContainsIgnoreCase is missing.

@tarekgh
Copy link
Member

tarekgh commented Apr 17, 2021

Can't we have e.g. CompareToNormal as opposed to CompareToString to align with members in StringComparison and StringComparer?

The issue here is if you are developer new to .NET, would the name CompareToNormal would be better than CompareToString? The term Normal is the best we can find to fit in StringComparison but we welcome any better names if there is any.

- public static int Compare(System.String? strA, System.String? strB, bool ignoreCase) { throw null; }
- public static int Compare(System.String? strA, System.String? strB, bool ignoreCase, System.Globalization.CultureInfo? culture) { throw null; }
- public static int Compare(System.String? strA, System.String? strB, System.Globalization.CultureInfo? culture, System.Globalization.CompareOptions options) { throw null; }
+ public static int CompareString(System.String? strA, System.String? strB) { throw null; }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels weird that string will be two times in the sentence when calling these APIs (string.CompareString) I guess there is no better naming/options?... I definitely like this better than including Ordinal in the name though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Downside is that the good names were taken by the method we want to steer people away from. :(

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this is the thing that urks me. Our goal is to present the developer with a simpler view of the world, but the new names look fairly misplaced and odd :-(

Copy link

@iSazonov iSazonov Apr 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In PowerShell repo we follow simple rule - "use explicitly culture/ordinal/ignorecase everywhere".
You will find that this code is full of explicit Ordinal and OrdinalIgnorecase names
This makes the code incredibly understandable and maintainable.
Despite this, we've fixed a few case of unexpected use of culture. So the PR with new conception is great!

But the new names looks weird for me too.
As I said, having Ordinal and OrdinalIgnorecase names in PowerShell code makes it better in my opinion.
Notice, we already have CompareOrdinal() (and internal EqualsOrdinal/EqualsOrdinalIgnoreCase). Why not follow this pattern?

CompareToString -> CompareToOrdinal
CompareToStringIgnoreCase -> CompareToOrdinalIgnoreCase

And this is completely consistent with the new concept.
It also makes it easier to learn and understand for people coming from other languages.

This concept (must) be consistent with the fact that the best practice is to explicitly express intent. Here this means using explicit method names or options with Ordinal/OrdinalIgnorecase.

This makes creating fixers easier. We can't just say "replace CompareTo() with CompareToOrdinal()" because the source code may contain a bug. But we can recommend both options (ordinal and culture).


1. `string` and related APIs should default to ordinal behavior unless the caller makes an explicit request for a different behavior.

2. Developers should not be exposed to any globalization-related concepts unless they explicitly seek globalization support.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this claim specific to APIs on String? For example, how does this relate to formatting/parsing APIs on types like Int32, DateTime, etc., which all factor in the current culture by default?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps I should clarify this. The concept of "ordinal" only applies to APIs which operate on or transform strings / sequences of chars. (Think: ReadOnlySpan<char> APIs in MemoryExtensions.)

For APIs which are about changing representation between string and non-string types (e.g., string <-> int), I don't suggest any behavioral change. These APIs necessarily have to perform some kind of linguistic translation, so they may as well fall back to current culture as they do today. That also helps solidify Parse and ToString as helpers for "please convert this data from / to a representation which is understandable by the current user."

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The concept of "ordinal" only applies to APIs which operate on or transform strings / sequences of chars.

Yes. I was commenting on the second goal/requirement: "Developers should not be exposed to any globalization-related concepts unless they explicitly seek globalization support." There are many globalization-related concepts unrelated to ordinal.


And we propose to take the following actions on `string`-like APIs.

1. Mark all linguistic-related APIs as `[EditorBrowsable(EditorBrowsableState.Never)]`, but do not obsolete them. (Exception: do not mark APIs if they're in the _System.Globalization_ namespace.)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related to my previous comment, how does this relate to methods that implement interfaces or override virtuals? e.g. Int32.ToString() factoring in the current current. The point earlier talked about not needing to understand globalization concepts... was that too broad, and this is all really just about linguistics-related globalization concepts?


While searching through internal and public sources, we discovered significant usage of the members `StringComparison.InvariantCulture` and `StringComparison.InvariantCultureIgnoreCase`. The usage numbers for these APIs is much higher than would be reasonably expected given their scenario of performing linguistic (somewhat "fuzzy") text matching. This may be explained as an interesting byproduct of having drilled [__CA1305__](https://docs.microsoft.com/dotnet/fundamentals/code-analysis/quality-rules/ca1305) (specify `CultureInfo.InvariantCulture` when calling `ToString(IFormatProvider, ...)`) and ["prefer `string.ToUpperInvariant` over `string.ToUpper`"](https://docs.microsoft.com/dotnet/api/system.string.toupper) into developers' muscle memory over the past decades. This can form a positive association that makes developers believe that __"invariant"__ is always an appropriate default to use.

So when rules like [__CA1307__](https://docs.microsoft.com/dotnet/fundamentals/code-analysis/quality-rules/ca1307) (pass an explicit `StringComparison` value) are enforced, the developer is now forced to choose between ordinal, invariant, or current-culture. It is not reasonable to expect the average developer to understand the difference between ordinal and linguistic behavior. But it _is_ reasonable for an average developer to think to themselves, "I was told _invariant_ was correct in all other situations where I encountered it. It must also be the appropriate default here.". (It would further be reasonable that this decision may be made via muscle memory, with the developer never consciously realizing that they were presented a choice.)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see the argument for hiding CurrentCulture/CurrentCultureIgnoreCase. If someone explicitly types "CurrentCulture", that's what they get; it's very clear.


## Future work and ecosystem implications

This work has implications for GUI stacks, such as WPF and WinForms. GUI scenarios often involve performing linguistic, current-culture collation of data, such as alphabetically sorting entries in a dropdown list. It should be expected that WinForms, WPF, and similar code bases will be disproportionately hit with warnings. They have a few options available to them. In descending order of preference:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similar code bases will be disproportionately hit with warnings

Not only hit with warnings, but there's a good chance pushing them to use ordinal behavior will actually introduce bugs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's the weird part about all of this. The rule of thumb is: if you're sorting data for display directly in client UI, the default (current culture) behavior is probably appropriate. If you're on a server, a current culture comparison is probably not appropriate. So it largely depends on the context.

However, there is one universal truth: the pattern if (string.Compare(...) == 0) is almost certainly a bug, regardless of context. There are exceptions, sure, but they're very rare.

Maybe the answer is that we don't hide the original methods, but we instead query what you're doing with the result. If we see a comparison result == 0 or != 0 we flag it. This will catch fewer misusages of the comparison APIs, but experience tells me that the warnings will be more actionable / of higher quality.

@terrajobst
Copy link
Member

@GrabYourPitchforks should this be marked with api-ready-for-review so we can discuss this?

@danmoseley
Copy link
Member

I expect @stephentoub would want to have a chance to be part of API review before anything's definitely decided, but no doubt helpful to start discussions.

@GrabYourPitchforks
Copy link
Member Author

@terrajobst I suspect if we wanted to have a public discussion, it should be a design-specific discussion with no expectation that we would finalize an API immediately. We'd want Steve and usability experts to be involved prior to finalization. This document also suggests usability testing to help test our theories.

@KevinCathcart
Copy link

While I appreciate the goal of having it be easy to do the correct thing and avoid accidental use of lingusitic operations, I have a lot of concerns here about ergonomics, and poor user experience. You already mention the new developer using an old tutotrial and seeing warnings. But what about experienced developers who move to a newer .NET and are confused/frustrated about overloads they know exist (and are use to using, because they are the "correct" way to do things as of right now) don't show up in Intellisense because they have been made EBNever.

I do have to wonder if for application code, users would not be better served by have an easy way to set the Default culture of all threads to some special variant of the Invariant culture that does ordinal comparisons/equality etc instead of the unicode default algorithms. This would actually make the existing APIs generally do the right thing, and would not not be seriously regressing things for people creating GUI apps, Sadly such users will always need to decide for each case if they want a linguistic operation, or a ordinal one. This design doc says "It is not reasonable to expect the average developer to understand the difference between ordinal and linguistic behavior", but for GUI app developers kind of do need to understand. For things like parsing file formats, they often want ordinal, but for things related to user input, they often want linguistic.

Obviously libraries would still need to avoid accidentally using linguistic methods if they really want ordinal, unless the libraries will only be used by programs that use this special default culture.

@mhutch
Copy link
Member

mhutch commented May 4, 2021

I love the goal of this proposal but the new *String method names seem odd/ugly. Given the prevalence of these methods, would it be worth considering more radical methods to "fix" the current API e.g. runtime changes to enable assembly-level default comparison?

@GrabYourPitchforks
Copy link
Member Author

I do have to wonder if for application code, users would not be better served by have an easy way to set the Default culture of all threads to some special variant of the Invariant culture that does ordinal comparisons/equality etc instead of the unicode default algorithms. This would actually make the existing APIs generally do the right thing, and would not not be seriously regressing things for people creating GUI apps,

We considered but rejected this proposal. The reason is that it would affect more fundamental APIs like int.ToString, which we generally do want to match to the user's current locale by default. It would also significantly impact people writing GUI apps, who rely on things culture-aware sorting when displaying items in a list box.

@mrange
Copy link

mrange commented Aug 14, 2021

I am positive towards increasing the pit of success when it comes to string APIs. No developers expect strings API to behave differently depending on global thread static variable (current culture) until the first long-night debugging session due to a localization bug.

I am concerned about the string interpolation feature in C# and F# and how that can be made less dependent on current culture. I realize there's a FormattableString.Invariant() method but I doubt I get developers to type : FormattableString.Invariant($"Hello {x}") over $"Hello {x}" as large part of the charm with string interpolation is the succinctness which is then lost.

If possible, I would like to see an optional analyzer that warns for all usages string interpolation without FormattableString.Invariant so it's at least easy to find all the places.

@huoyaoyuan
Copy link
Member

The reason is that it would affect more fundamental APIs like int.ToString, which we generally do want to match to the user's current locale by default.

Being culture aware in those members are so problematic:

  • int.ToString or int.Parse can't be optimized further since they have to touch current culture, or access InvariantCulture. Converting between int and string very likely to be used frequently. Passing CultureInfo.InvariantCulture or NumberFormatInfo.InvariantInfo is even more annoying. double is more impactful.
  • Round-tripping of DataTime is a nightmare. Formatting in ddmmyy culture and parsing in mmddyy culture can easily lead to bugs, which I have already heard about.
  • Usage of IEquatable<string> is merely unavoidable, especially for tuples and records. Manually making strings ordinal in them will lose almost all the convenience of them.

With my personal bias, as a Chinese speaker, casing and sorting are totally nothing in Chinese. The only thing I benefit from being culture aware is displaying DateTime.

@danmoseley
Copy link
Member

We will soon be planning for .NET 8. Is it an appropriate time to pick this up again?

@drewnoakes
Copy link
Member

Pinging @dotnet/project-system as we are considering modernising the resource editor in Visual Studio.

@Smaug123
Copy link

Smaug123 commented Nov 15, 2023

Minor suggestion that doesn't change anything fundamental about the discussion: how about String.HasPrefix and String.HasSuffix rather than {Starts,Ends}WithString? English conveniently has (at least) two terse ways to express "B starts with A".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.