Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[lex.ccon] What is the single code unit for an ordinary character literal or wide character literal? #4517

Open
xmh0511 opened this issue Feb 24, 2021 · 3 comments

Comments

@xmh0511
Copy link
Contributor

xmh0511 commented Feb 24, 2021

As the special rules specified in [lex.ccon]#1, that is:

A non-encodable character literal is a character-literal whose c-char-sequence consists of a single c-char that is not a numeric-escape-sequence and that specifies a character that either lacks representation in the literal's associated character encoding or that cannot be encoded as a single code unit.

The Unicode standard specifies how large a code unit for UTF8, UTF16, and UTF32 respectively. Which has a similar meaning as stated in wiki Character_encoding. However, it does not state how large the code unit for the encoding of the execution (wide-)character set. So, in this case, how to determine whether a code point value for a character in an ordinary or wide character literal can be encoded as a single code unit for the corresponding kind character literal?

Is it a good idea to change the wording "cannot be encoded as a single code unit" to "cannot be represented by an object with the type of the corresponding kind character-literal"?

@xmh0511 xmh0511 changed the title what is the single code unit for a ordinary character literal what is the single code unit for an ordinary character literal or wide character literal Feb 24, 2021
@jensmaurer
Copy link
Member

I think [basic.fundamental] p7 and p8 try to establish the relationship between the type and code unit, but this could certainly be clearer.

@xmh0511
Copy link
Contributor Author

xmh0511 commented Feb 25, 2021

I think [basic.fundamental] p7 and p8 try to establish the relationship between the type and code unit, but this could certainly be clearer.

Although p7 states

The values of type char can represent distinct codes for all members of the implementation's basic character set.

However, here is unclear that whether the wording "implementation's basic character set" refers to "basic source character set " or "basic execution character set". Presumably, it refers to the latter. But, as stated in [lex.charset#3]. Execution character set is a superset of a basic execution character set.

Take Execution character set as set S and take basic execution character set as set A where A⊆S

As the lex.ccon#tab:lex.ccon.literal indicates, we don't know whether an element in the absolute complement set(∁UA) of basic execution character set can be encoded in a char object. After all, the standard does not specify how to encode an execution character set except that it specifies the value 0 for the null character.

@jensmaurer jensmaurer changed the title what is the single code unit for an ordinary character literal or wide character literal [lex.ccon] What is the single code unit for an ordinary character literal or wide character literal? Mar 9, 2021
@jensmaurer
Copy link
Member

jensmaurer commented Mar 26, 2021

This is being addressed by P2314 Character sets and encodings cplusplus/papers#998.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants