-
Notifications
You must be signed in to change notification settings - Fork 13.6k
Description
This is a tracking issue for the RFC 3349 (rust-lang/rfcs#3349).
The feature gate for the issue is #![feature(mixed_utf8_literals)]
.
From the RFC:
Relax the restrictions on which characters and escape codes are allowed in string, char, byte string, and byte literals.
Most importantly, this means we accept the exact same characters and escape codes in
"…"
andb"…"
literals. That is:
- Allow unicode characters, including
\u{…}
escape codes, in byte string literals. E.g.b"hello\xff我叫\u{1F980}"
- Also allow non-ASCII
\x…
escape codes in regular string literals, as long as they are valid UTF-8. E.g."\xf0\x9f\xa6\x80"
About tracking issues
Tracking issues are used to record the overall progress of implementation. They are also used as hubs connecting to other relevant issues, e.g., bugs or open design questions. A tracking issue is however not meant for large scale discussion, questions, or bug reports about a feature. Instead, open a dedicated issue for the specific matter and add the relevant feature gate label.
Steps
- Implement the RFCAdjust documentation (see instructions on rustc-dev-guide)Stabilization PR (see instructions on rustc-dev-guide)To pick up a draggable item, press the space bar. While dragging, use the arrow keys to move the item. Press space again to drop the item in its new position, or press escape to cancel.
Unresolved Questions
- Should
concat!("\xf0\x9f", "\xa6\x80")
work? (The string literals are not valid UTF-8 individually, but are valid UTF-8 after being concatenated.)To pick up a draggable item, press the space bar. While dragging, use the arrow keys to move the item. Press space again to drop the item in its new position, or press escape to cancel.
Activity
traviscross commentedon Oct 18, 2023
@rustbot labels +T-lang
traviscross commentedon Oct 19, 2023
@rustbot labels +B-rfc-approved
c"…"
string literals #105723nnethercote commentedon Dec 6, 2023
I would like to take this one.
nnethercote commentedon Dec 13, 2023
I have a partial implementation of this RFC working locally (EDIT: now at #120286). The RFC proposes five changes to literal syntax. I think three of them are good, and two of them aren't necessary.
b""
: add unicode charsAdding them fixes the first of two cases where
b""
syntax isn't a superset of""
syntax. This is good, and facilitates "conventionally UTF-8" string literals.br""
: add unicode charsAdding them fixes the one case where
rb""
syntax isn't a superset ofr""
syntax. After this,rb""
syntax andr""
syntax are the same. This is good, and also facilitates "conventionally UTF-8" string literals.b""
: add\u{NN}
escapesAdding them fixes the second of two cases where
b""
syntax isn't a superset of""
syntax, and fits well with adding unicode chars. This is good.Note: After adding this, the one thing
b""
syntax has that""
syntax does not is\x80-\xff
bytes.""
: add\x80-\xff
Is this necessary? What useful new functionality does this provide?
It would make
""
andb""
syntax identical, but strings and byte strings aren't identical types, so that identicalness isn't needed.The RFC says "Allowing all characters and all known escape codes in both types of string literals reduces the complexity of the language. We'd no longer have different escape codes for different literal types. We'd only require regular string literals to be valid UTF-8." So it has just traded one exception for another. IMO that's not a simplification.
It's odd that it would be possible to write a
""
that isn't valid UTF-8... both conceptually, and in the implementation. For the latter you can no longer start with an emptyString
and append chars one at a time knowing it'll be valid UTF-8 the whole way, which is how it's currently handled. Instead you need to start with aVec<u8>
, append chars as byte sequences, and then UTF-8 validate at the end. It's not that difficult, but it's not needed for any other literal kind, and weird enough that, combined with the other points above, makes me question it.Not doing this keeps
""
syntax consistent with''
, which makes sense given that""
and''
are both unicode-oriented rather than byte-oriented. This is another refutation of the complexity argument above.Not doing this was suggested in the "Alternatives" section of the RFC.
Not doing this also renders moot the unresolved question of what to do with
concat!("\xf0\x9f", "\xa6\x80")
.b''
: add\u{00}-\u{7f}
Is this necessary? It doesn't provide any useful new functionality.
The
\x
syntax is strictly more powerful, covering the range\x00-0xff
. And supporting just the ASCII subset of\u
escapes doesn't match behaviour of any of the other literal syntaxes. Byte literals are about a single byte, why introduce Unicode-related stuff?The quote from the RFC I mentioned above about complexity applies again, but again, it's just trading one exception for another.
cc @rust-lang/lang @m-ou-se
nnethercote commentedon Dec 13, 2023
Here's an alternative version of the table that I've been using and found helpful. It shows all the escapes directly instead of grouping them by name, it shows the changes proposed by the RFC (affected literal kinds have two lines connected by a
-->
, where the second line shows what changed), and it includes C string literals. The proposed changes I don't like are marked with?
.This makes it easier to see things like adding
\x80-\xff
to""
syntax would make it identical tob""
syntax, but also make "" syntax different to''
syntax.nnethercote commentedon Dec 13, 2023
BTW, I have implemented the first three changes. They were pretty easy, and piggy-backed naturally off the existing support for mixed utf8 in C string literals, requiring only minor changes.
I haven't implemented the last two. They would both have required new kinds of checks, somewhat annoying to implement, which is what got me thinking about whether they are necessary.
15 remaining items