Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[lex.charset] p5 The associated character type of a code unit is not clear #5247

Open
xmh0511 opened this issue Feb 2, 2022 · 0 comments
Open

Comments

@xmh0511
Copy link
Contributor

xmh0511 commented Feb 2, 2022

[lex.charset] p5 just states

A code unit is an integer value of character type ([basic.fundamental]). Characters in a character-literal other than a multicharacter or non-encodable character literal or in a string-literal are encoded as a sequence of one or more code units, as determined by the encoding-prefix ([lex.ccon], [lex.string]);

What's the concrete character type? which determines the character type? We just say the characters will be encoded as a sequence of one or more code units, in other words, a sequence of integer values of the character type. The clear clarification of the character type is significant. Consider this example:

 auto c = 'ʉ'.

The Unicode code point value of the character ʉ is 289. [lex.ccon] p1 just states

A non-encodable character literal is a character-literal whose c-char-sequence consists of a single c-char that is not a numeric-escape-sequence and that specifies a character that either lacks representation in the literal's associated character encoding or that cannot be encoded as a single code unit.

So, whether it is a non-encodable character literal depends on:

  • lacks representation in the literal's associated character encoding
  • cannot be encoded as a single code unit

Assume that the first bullet is always false in a circumstance. So, whether ʉ is a non-encodable character literal depends on the range a code unit can represent, which means the representable values for the character type. we didn't explicitly specify the character type for the code unit of a different kind of character-literal or string-literal. Although, it is implied by the Type in the corresponding table.

Should we improve [lex.charset] p5 to make that meaning to be clearer?

A code unit is an integer value of character type ([basic.fundamental]). Characters in a character-literal other than a multicharacter or non-encodable character literal or in a string-literal are encoded as a sequence of one or more code units, as determined by the encoding-prefix ([lex.ccon], [lex.string]); where the character type of a code unit is specified by the type or element type.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant