Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[lex.charset] Change "short name" to "short identifier" to match ISO 10646 #2201

Closed
wants to merge 1 commit into from

Conversation

rmartinho
Copy link

@rmartinho rmartinho commented Jun 20, 2018

ISO 10646 doesn't have a "short name" concept (there is a "Jamo short name" but that's something specific to the Hangul script; clearly not the intended meaning here). What ISO 10646 does have, is a "short identifier" concept, which is clearly what is intended here. I have made minimal changes to this wording in order to use the "short identifier" concept.

For clarity, I am reproducing here the relevant text from ISO 10646.

a) The six-digit form of short identifier consists of the sequence of six hexadecimal digits that represents the code point of the character (see 6.2).

b) The four-to-five-digit form of short identifier shall consist of the last four to five digits of the six-digit form. Leading zeroes beyond four digits are suppressed.

c) The character “+” (PLUS SIGN) may, as an option, precede the digit form of short identifier.

d) The prefix letter “U” (LATIN CAPITAL LETTER U) may, as an option, precede any of the three forms of short identifier defined in a) to c) above.

The capital letters A to F, and U that appear within short identifiers may be replaced by the corresponding small letters.

Also note that "short identifier" is already used in [cpp.predefined], 2.4 (http://eel.is/c++draft/cpp.predefined#2.4)

Fixes #2109.

@rmartinho rmartinho changed the title Change "short name" to "short identifier" to match ISO 10646 [lex.charset] Change "short name" to "short identifier" to match ISO 10646 Jun 20, 2018
source/lex.tex Outdated
\tcode{NNNNNNNN}; the character designated by the \grammarterm{universal-character-name}
\tcode{\textbackslash uNNNN} is that character whose character short name in
ISO/IEC 10646 is \tcode{0000NNNN}. If the hexadecimal value for a
U00NNNNNN} is that character whose character short identifer in ISO/IEC 10646 is
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: "identifer"

\tcode{\textbackslash uNNNN} is that character whose character short name in
ISO/IEC 10646 is \tcode{0000NNNN}. If the hexadecimal value for a
U00NNNNNN} is that character whose character short identifer in ISO/IEC 10646 is
\tcode{NNNNNN}; the character designated by the \grammarterm{universal-character-name}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if I write \U00000041 in my source code? Does the character short identifier "000041" exist in Unicode? If it does exist, what about \u0041; is the character short identifier here "0041"? Why are there two identifiers naming the same thing? What about lowercase vs. uppercase hex digits? Should we refer to the hexadecimal value somehow?

Copy link
Author

@rmartinho rmartinho Jun 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the short identifier concept gives more than one identifier for each character. For the character with scalar value 0x4A, all of the following are valid short identifiers: 00004A, 004A, +00004A, +004A, U00004A, U004A, U+00004A, U+004A, 00004a, 004a, +00004a, +004a, U00004a, U004a, U+00004a, U+004a, u00004A, u004A, u+00004A, u+004A, u00004a, u004a, u+00004a, u+004a. Any of those unambiguously identifies the same character. If there were more A-F digits, there would be even more possible identifiers (there are 384 possible short identifiers for 0xAAAAA). The syntax is given with the description I quoted above, and also with the following BNF:

{ U | u } {+}(xxxx | xxxxx | xxxxxx)

where “x” represents one hexadecimal digit (0 to 9, A to F, or a to f), and with the additional requirement that the 5-digit form is not allowed to have leading zeros (so 0041 and 000041 are both valid, but 00041 isn't). I don't know why the choice was made to have this much flexibility.

Referring to the hexadecimal value may actually be a better choice; that wording is actually used in the very next sentence to forbid surrogates. If we want to do it this way, I would rewrite in the following manner.

The character designated by the universal-character-name \UNNNNNNNN is that character whose code point in ISO/IEC 10646 has the hexadecimal value NNNNNNNN; the character designated by the universal-character-name \uNNNN is that character whose code point in ISO/IEC 10646 has the hexadecimal value 0000NNNN.

(That last bit can also be "has the hexadecimal value NNNN", without the leading zeros.)

Also for clarity, ISO 10646 defines "code point" as "value in the UCS codespace", (UCS being short for the character set specified by ISO 10646).

I originally just did s/short name/short identifier/ because that produced minimal changes, but I can rephrase it in terms of hexadecimal value as above if that's preferred.

Copy link
Member

@jensmaurer jensmaurer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation. I think the current set of changes is good, although I have a few questions that should be answered by a core issue. For example, what happens if I say \U99004141. Is that ill-formed or undefined behavior or something else? Also, I would be very much in favor of harmonizing towards U+1234 references when talking about Unicode characters.

@jensmaurer
Copy link
Member

jensmaurer commented Jun 27, 2018

Oh, could you please squash all commits and force-push? Thanks. And the commit message should have "[lex.charset]" in front.

U00NNNNNN} is that character whose character short identifier in ISO/IEC 10646 is
\tcode{NNNNNN}; the character designated by the \grammarterm{universal-character-name}
\tcode{\textbackslash uNNNN} is that character whose character short identifier in
ISO/IEC 10646 is \tcode{NNNN}. If the hexadecimal value for a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the driveby, but why do we say "hexadecimal value"? Why not just "value"? In which way does the value depend on a particular serialization format?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we keep that separate, please? This is enough of a tar pit already, and might benefit from a more wholesale rework.

@rmartinho
Copy link
Author

I'll squash and fix the commit message.
Regarding the core issue, should I open one, then?
And harmonizing to U+ notation, would that need a paper? I can write it if so.

@jensmaurer
Copy link
Member

@rmartinho, I think with the editorial change we're currently looking at, we've got a good improvement: from "undefined term" to "well-defined term".
I must admit I'm not so enthusiastic trying to meddle with the words here even more, but if you feel like writing a short paper (essentially showing a single-line summary for each issue addressed plus the wording changes), and that also cleans up @tkoeppe's concern, let's go for it. No need to have a core issue on top of that paper.
The thrust of the paper should be to use the terms (such as surrogate pair) from ISO 10646 as defined there and to make sure that we keep all explanations of such terms (e.g. value ranges) to non-normative text.

@rmartinho
Copy link
Author

Squashed and fixed the commit message. I'll give that paper a thought, then.

@zygoloid
Copy link
Member

I would like to change from NNNNNN to U+NNNNNN (for this particular wording) in this change; we're already using U+NNNN in other places, and it seems to be the more common form for unambiguously writing Unicode character short identifiers (though I don't know if ISO/IEC 10646 specifies a preference between the valid forms).

@zygoloid
Copy link
Member

We should also agree on what typeface to use for Unicode short identifiers. In [time.duration.io]p4, we use body text font, complete with its not-especially-aesthetically-appealing plus sign with slightly unsatisfying kerning.

In Table 2, we use teletype font (and no U+ prefix).

http://www.unicode.org/versions/Unicode11.0.0/appA.pdf says that dropping the U+ prefix is appropriate in tables and in ranges, so what we're doing in Table 2 seems fine. It uses body text font, but has a more appealing plus sign than appears in our body font.

\tcode{NNNNNNNN}; the character designated by the \grammarterm{universal-character-name}
\tcode{\textbackslash uNNNN} is that character whose character short name in
ISO/IEC 10646 is \tcode{0000NNNN}. If the hexadecimal value for a
U00NNNNNN} is that character whose character short identifier in ISO/IEC 10646 is
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This rewording has lost the specification for a universal-character-name beginning with \U01 (etc). I think we need a normative change to properly address this -- it doesn't seem right to just remove the specification for these cases, but the old specification is clearly wrong, as there is no character with the specified short identifier.

@zygoloid zygoloid added cwg Issue must be reviewed by CWG. and removed cwg Issue must be reviewed by CWG. labels Jun 29, 2018
@zygoloid
Copy link
Member

Question for CWG: what is the status of a program like:

char32_t x[] = U"\U00110000";  // U+110000 is a Unicode short identifier but there is no such character
char32_t y[] = U"\U01000000"; // U+1000000 is not a Unicode short identifier

@zygoloid zygoloid force-pushed the master branch 2 times, most recently from e3dbfe2 to 1a21a65 Compare July 7, 2018 23:19
@jensmaurer
Copy link
Member

Regarding @zygoloid questions, it seems we should not require the compiler to contain a list of valid characters. (In particular, since that list is updated from time to time.) Thus, "x" should be syntactically valid and produce the expected number. In contrast, "y" should be ill-formed.

@jensmaurer jensmaurer added the not-editorial Issue is not deemed editorial; the editorial issue is kept open for tracking. label Oct 11, 2018
@rmartinho
Copy link
Author

rmartinho commented Jan 21, 2019

I have submitted P1139R0 to address the remaining issues as discussed here.

@jensmaurer
Copy link
Member

Fixed by P1139R2 Address wording issues related to ISO 10646 #2687.

@jensmaurer jensmaurer closed this Mar 8, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cwg Issue must be reviewed by CWG. not-editorial Issue is not deemed editorial; the editorial issue is kept open for tracking.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[lex.charset] ISO/IEC 10646 does not define "character short name"
4 participants