Skip to content

Tracking Issue for unicode and escape codes in literals #116907

@traviscross

Description

@traviscross
Contributor

This is a tracking issue for the RFC 3349 (rust-lang/rfcs#3349).

The feature gate for the issue is #![feature(mixed_utf8_literals)].

From the RFC:

Relax the restrictions on which characters and escape codes are allowed in string, char, byte string, and byte literals.

Most importantly, this means we accept the exact same characters and escape codes in "…" and b"…" literals. That is:

  • Allow unicode characters, including \u{…} escape codes, in byte string literals. E.g. b"hello\xff我叫\u{1F980}"
  • Also allow non-ASCII \x… escape codes in regular string literals, as long as they are valid UTF-8. E.g. "\xf0\x9f\xa6\x80"

About tracking issues

Tracking issues are used to record the overall progress of implementation. They are also used as hubs connecting to other relevant issues, e.g., bugs or open design questions. A tracking issue is however not meant for large scale discussion, questions, or bug reports about a feature. Instead, open a dedicated issue for the specific matter and add the relevant feature gate label.

Steps

Unresolved Questions

  • Should concat!("\xf0\x9f", "\xa6\x80") work? (The string literals are not valid UTF-8 individually, but are valid UTF-8 after being concatenated.)

Activity

added
C-tracking-issueCategory: An issue tracking the progress of sth. like the implementation of an RFC
on Oct 18, 2023
traviscross

traviscross commented on Oct 18, 2023

@traviscross
ContributorAuthor

@rustbot labels +T-lang

traviscross

traviscross commented on Oct 19, 2023

@traviscross
ContributorAuthor

@rustbot labels +B-rfc-approved

added
B-RFC-approvedBlocker: Approved by a merged RFC but not yet implemented.
on Oct 19, 2023
self-assigned this
on Dec 6, 2023
nnethercote

nnethercote commented on Dec 6, 2023

@nnethercote
Contributor

I would like to take this one.

nnethercote

nnethercote commented on Dec 13, 2023

@nnethercote
Contributor

I have a partial implementation of this RFC working locally (EDIT: now at #120286). The RFC proposes five changes to literal syntax. I think three of them are good, and two of them aren't necessary.

b"": add unicode chars

Adding them fixes the first of two cases where b"" syntax isn't a superset of "" syntax. This is good, and facilitates "conventionally UTF-8" string literals.

br"": add unicode chars

Adding them fixes the one case where rb"" syntax isn't a superset of r"" syntax. After this, rb"" syntax and r"" syntax are the same. This is good, and also facilitates "conventionally UTF-8" string literals.

b"": add \u{NN} escapes

Adding them fixes the second of two cases where b"" syntax isn't a superset of "" syntax, and fits well with adding unicode chars. This is good.

Note: After adding this, the one thing b"" syntax has that "" syntax does not is \x80-\xff bytes.

"": add \x80-\xff

Is this necessary? What useful new functionality does this provide?

It would make "" and b"" syntax identical, but strings and byte strings aren't identical types, so that identicalness isn't needed.

The RFC says "Allowing all characters and all known escape codes in both types of string literals reduces the complexity of the language. We'd no longer have different escape codes for different literal types. We'd only require regular string literals to be valid UTF-8." So it has just traded one exception for another. IMO that's not a simplification.

It's odd that it would be possible to write a "" that isn't valid UTF-8... both conceptually, and in the implementation. For the latter you can no longer start with an empty String and append chars one at a time knowing it'll be valid UTF-8 the whole way, which is how it's currently handled. Instead you need to start with a Vec<u8>, append chars as byte sequences, and then UTF-8 validate at the end. It's not that difficult, but it's not needed for any other literal kind, and weird enough that, combined with the other points above, makes me question it.

Not doing this keeps "" syntax consistent with '', which makes sense given that "" and '' are both unicode-oriented rather than byte-oriented. This is another refutation of the complexity argument above.

Not doing this was suggested in the "Alternatives" section of the RFC.

Not doing this also renders moot the unresolved question of what to do with concat!("\xf0\x9f", "\xa6\x80").

b'': add \u{00}-\u{7f}

Is this necessary? It doesn't provide any useful new functionality.

The \x syntax is strictly more powerful, covering the range \x00-0xff. And supporting just the ASCII subset of \u escapes doesn't match behaviour of any of the other literal syntaxes. Byte literals are about a single byte, why introduce Unicode-related stuff?

The quote from the RFC I mentioned above about complexity applies again, but again, it's just trading one exception for another.

cc @rust-lang/lang @m-ou-se

nnethercote

nnethercote commented on Dec 13, 2023

@nnethercote
Contributor

Here's an alternative version of the table that I've been using and found helpful. It shows all the escapes directly instead of grouping them by name, it shows the changes proposed by the RFC (affected literal kinds have two lines connected by a -->, where the second line shows what changed), and it includes C string literals. The proposed changes I don't like are marked with ?.

        chars    escapes                                        mixed utf8
        -----    -------                                        ----------
- ''    unicode  \' \" \n \r \t \\ \0 \x00-\x7f \u{..}          no
    
- b''   ascii    \' \" \n \r \t \\ \0 \x00-\xff                 no    
  -->                                           \u{0}..\u{7f}?  yes?
    
- ""    unicode  \' \" \n \r \t \\ \0 \x00-\x7f \u{..}          no
  -->                                 \x00-\xff?                yes?

- r""   unicode  N/A                                            no

- b""   ascii    \' \" \n \r \r \\ \0 \x00-0xff                 no
  -->   unicode                                 \u{..}          yes
    
- br""  ascii    N/A                                            no
  -->   unicode
  
- c""   unicode  \' \" \n \r \t \\ __ \x01-0xff \u{..}          yes

- cr""  unicode  N/A                                            no

This makes it easier to see things like adding \x80-\xff to "" syntax would make it identical to b"" syntax, but also make "" syntax different to '' syntax.

nnethercote

nnethercote commented on Dec 13, 2023

@nnethercote
Contributor

BTW, I have implemented the first three changes. They were pretty easy, and piggy-backed naturally off the existing support for mixed utf8 in C string literals, requiring only minor changes.

I haven't implemented the last two. They would both have required new kinds of checks, somewhat annoying to implement, which is what got me thinking about whether they are necessary.

15 remaining items

Loading
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

Labels

B-RFC-approvedBlocker: Approved by a merged RFC but not yet implemented.C-tracking-issueCategory: An issue tracking the progress of sth. like the implementation of an RFCF-mixed_utf8_literals#![feature(mixed_utf8_literals)]I-lang-radarItems that are on lang's radar and will need eventual work or consideration.T-langRelevant to the language team

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

    Development

    No branches or pull requests

      Participants

      @joshtriplett@m-ou-se@traviscross@nnethercote@rustbot

      Issue actions

        Tracking Issue for unicode and escape codes in literals · Issue #116907 · rust-lang/rust