[lex.charset] p5 The associated character type of a code unit is not clear #5247

xmh0511 · 2022-02-02T15:11:49Z

[lex.charset] p5 just states

A code unit is an integer value of character type ([basic.fundamental]). Characters in a character-literal other than a multicharacter or non-encodable character literal or in a string-literal are encoded as a sequence of one or more code units, as determined by the encoding-prefix ([lex.ccon], [lex.string]);

What's the concrete character type? which determines the character type? We just say the characters will be encoded as a sequence of one or more code units, in other words, a sequence of integer values of the character type. The clear clarification of the character type is significant. Consider this example:

 auto c = 'ʉ'.

The Unicode code point value of the character ʉ is 289. [lex.ccon] p1 just states

A non-encodable character literal is a character-literal whose c-char-sequence consists of a single c-char that is not a numeric-escape-sequence and that specifies a character that either lacks representation in the literal's associated character encoding or that cannot be encoded as a single code unit.

So, whether it is a non-encodable character literal depends on:

lacks representation in the literal's associated character encoding

cannot be encoded as a single code unit

Assume that the first bullet is always false in a circumstance. So, whether ʉ is a non-encodable character literal depends on the range a code unit can represent, which means the representable values for the character type. we didn't explicitly specify the character type for the code unit of a different kind of character-literal or string-literal. Although, it is implied by the Type in the corresponding table.

Should we improve [lex.charset] p5 to make that meaning to be clearer?

A code unit is an integer value of character type ([basic.fundamental]). Characters in a character-literal other than a multicharacter or non-encodable character literal or in a string-literal are encoded as a sequence of one or more code units, as determined by the encoding-prefix ([lex.ccon], [lex.string]); where the character type of a code unit is specified by the type or element type.

The text was updated successfully, but these errors were encountered:

xmh0511 mentioned this issue Mar 28, 2023

CWG2779 [lex.ccon] What are the types of single code units of character-literal and string-literral? cplusplus/CWG#285

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[lex.charset] p5 The associated character type of a code unit is not clear #5247

[lex.charset] p5 The associated character type of a code unit is not clear #5247

xmh0511 commented Feb 2, 2022

[lex.charset] p5 The associated character type of a code unit is not clear #5247

[lex.charset] p5 The associated character type of a code unit is not clear #5247

Comments

xmh0511 commented Feb 2, 2022