Skip to content

Commit 029f630

Browse files
committed
[lex.charset] Fix various issues with the description of UCNs.
Clarify that \U sequences not beginning 00 are ill-formed. Clarify handling of code points naming reserved or noncharacter code points. Remove unnecessary circumlocution through "short identifiers" by directly talking about code points. Use code point values directly rather than using C++ 0x notation. [lex.string] Fix description of what UCNs mean, and convert it to a note.
1 parent 035d46b commit 029f630

File tree

1 file changed

+22
-20
lines changed

1 file changed

+22
-20
lines changed

source/lex.tex

Lines changed: 22 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -241,18 +241,17 @@
241241
\terminal{\textbackslash U} hex-quad hex-quad
242242
\end{bnf}
243243

244-
The character designated by the \grammarterm{universal-character-name} \tcode{\textbackslash
245-
U00NNNNNN} is that character
246-
that has \tcode{U+NNNNNN} as a code point short identifier;
247-
the character designated by the \grammarterm{universal-character-name}
248-
\tcode{\textbackslash uNNNN} is that character
249-
that has \tcode{U+NNNN} as a code point short identifier.
250-
If a \grammarterm{universal-character-name} does not correspond to
251-
a code point in ISO/IEC 10646 or
252-
if a \grammarterm{universal-character-name} corresponds to
253-
a surrogate code point,
254-
the program is ill-formed. Additionally, if
255-
a \grammarterm{universal-character-name} outside
244+
A \grammarterm{universal-character-name}
245+
designates the character in ISO/IEC 10646 (if any)
246+
whose code point is the hexadecimal number represented by
247+
the sequence of \grammarterm{hexadecimal-digit}s
248+
in the \grammarterm{universal-character-name}.
249+
The program is ill-formed if that number is not a code point
250+
or if it is a surrogate code point.
251+
Noncharacter code points and reserved code points
252+
are considered to designate separate characters distinct from
253+
any ISO/IEC 10646 character.
254+
If a \grammarterm{universal-character-name} outside
256255
the \grammarterm{c-char-sequence}, \grammarterm{s-char-sequence}, or
257256
\grammarterm{r-char-sequence} of
258257
a character or
@@ -262,10 +261,10 @@
262261
\grammarterm{r-char-sequence}\iref{lex.string} does not form a
263262
\grammarterm{universal-character-name}.}
264263
\begin{note}
265-
ISO/IEC 10646 code points are within the range 0x0-0x10FFFF (inclusive).
266-
A surrogate code point is a value in the range 0xD800-0xDFFF (inclusive).
264+
ISO/IEC 10646 code points are integers in the range $[0, \mathrm{10FFFF}]$ (hexadecimal).
265+
A surrogate code point is a value in the range $[\mathrm{D800}, \mathrm{DFFF}]$ (hexadecimal).
267266
A control character is a character whose code point is
268-
in either of the ranges 0x0-0x1F or 0x7F-0x9F (both inclusive).
267+
in either of the ranges $[0, \mathrm{1F}]$ or $[\mathrm{7F}, \mathrm{9F}]$ (hexadecimal).
269268
\end{note}
270269

271270
\pnum
@@ -1219,7 +1218,7 @@
12191218
provided that the code point value
12201219
can be encoded as a single UTF-8 code unit.
12211220
\begin{note}
1222-
That is, provided the code point value is in the range 0x0-0x7F (inclusive).
1221+
That is, provided the code point value is in the range $[0, \mathrm{7F}]$ (hexadecimal).
12231222
\end{note}
12241223
If the value is not representable with a single UTF-8 code unit,
12251224
the program is ill-formed.
@@ -1238,7 +1237,7 @@
12381237
provided that the code point value is
12391238
representable with a single 16-bit code unit.
12401239
\begin{note}
1241-
That is, provided the code point value is in the range 0x0-0xFFFF (inclusive).
1240+
That is, provided the code point value is in the range $[0, \mathrm{FFFF}]$ (hexadecimal).
12421241
\end{note}
12431242
If the value is not representable
12441243
with a single 16-bit code unit, the program is ill-formed.
@@ -1771,9 +1770,12 @@
17711770
string literal is the number of code units, not the number of
17721771
characters.
17731772
\end{note}
1774-
Within \tcode{char32_t} and \tcode{char16_t}
1775-
string literals, any \grammarterm{universal-character-name}{s} shall be within the range
1776-
\tcode{0x0} to \tcode{0x10FFFF}. The size of a narrow string literal is
1773+
\begin{note}
1774+
Any \grammarterm{universal-character-name}{s} are required to
1775+
correspond to a code point in the range
1776+
$[0, \mathrm{D800})$ or $[\mathrm{E000}, \mathrm{10FFFF}]$ (hexadecimal)\iref{lex.charset}.
1777+
\end{note}
1778+
The size of a narrow string literal is
17771779
the total number of escape sequences and other characters, plus at least
17781780
one for the multibyte encoding of each \grammarterm{universal-character-name}, plus
17791781
one for the terminating \tcode{'\textbackslash 0'}.

0 commit comments

Comments
 (0)