This is an unofficial snapshot of the ISO/IEC JTC1 SC22 WG21 Core Issues List revision 114a. See http://www.open-std.org/jtc1/sc22/wg21/ for the official list.

2024-04-18


1403. Universal-character-names in comments

Section: 5.7  [lex.comment]     Status: CD6     Submitter: David Krauss     Date: 2011-10-05

[ Resolved by P2314R4, adopted in October, 2021. ]

According to 5.3 [lex.charset] paragraph 2,

If the hexadecimal value for a universal-character-name corresponds to a surrogate code point (in the range 0xD800-0xDFFF, inclusive), the program is ill-formed. Additionally, if the hexadecimal value for a universal-character-name outside the c-char-sequence, s-char-sequence, or r-char-sequence of a character or string literal corresponds to a control character (in either of the ranges 0x00-0x1F or 0x7F-0x9F, both inclusive) or to a character in the basic source character set, the program is ill-formed.

These restrictions should not apply to comment text. Arguably the prohibitions of control characters and characters in the basic character set already do not apply, as they require that the preprocessing tokens for literals have already been recognized; this occurs in phase 3, which also replaces comments with single spaces. However, the prohibition of surrogate code points is not so limited and might conceivably be applied within comments.

Probably the most straightforward way of addressing this problem would be simply to state in 5.7 [lex.comment] that character sequences that resemble universal-character-names are not recognized as such within comment text.

Additional note (February, 2022):

P2314R4 Character sets and encodings (approved in October, 2021) effected changes so that extended characters are no longer translated to UCNs in phase 1.