[lex.charset] Change "short name" to "short identifier" to match ISO 10646 #2201

rmartinho · 2018-06-20T12:47:54Z

ISO 10646 doesn't have a "short name" concept (there is a "Jamo short name" but that's something specific to the Hangul script; clearly not the intended meaning here). What ISO 10646 does have, is a "short identifier" concept, which is clearly what is intended here. I have made minimal changes to this wording in order to use the "short identifier" concept.

For clarity, I am reproducing here the relevant text from ISO 10646.

a) The six-digit form of short identifier consists of the sequence of six hexadecimal digits that represents the code point of the character (see 6.2).

b) The four-to-five-digit form of short identifier shall consist of the last four to five digits of the six-digit form. Leading zeroes beyond four digits are suppressed.

c) The character “+” (PLUS SIGN) may, as an option, precede the digit form of short identifier.

d) The prefix letter “U” (LATIN CAPITAL LETTER U) may, as an option, precede any of the three forms of short identifier defined in a) to c) above.

The capital letters A to F, and U that appear within short identifiers may be replaced by the corresponding small letters.

Also note that "short identifier" is already used in [cpp.predefined], 2.4 (http://eel.is/c++draft/cpp.predefined#2.4)

Fixes #2109.

jensmaurer · 2018-06-27T11:56:50Z

source/lex.tex

-\tcode{NNNNNNNN}; the character designated by the \grammarterm{universal-character-name}
-\tcode{\textbackslash uNNNN} is that character whose character short name in
-ISO/IEC 10646 is \tcode{0000NNNN}. If the hexadecimal value for a
+U00NNNNNN} is that character whose character short identifer in ISO/IEC 10646 is


typo: "identifer"

jensmaurer · 2018-06-27T12:00:20Z

source/lex.tex

-\tcode{\textbackslash uNNNN} is that character whose character short name in
-ISO/IEC 10646 is \tcode{0000NNNN}. If the hexadecimal value for a
+U00NNNNNN} is that character whose character short identifer in ISO/IEC 10646 is
+\tcode{NNNNNN}; the character designated by the \grammarterm{universal-character-name}


What if I write \U00000041 in my source code? Does the character short identifier "000041" exist in Unicode? If it does exist, what about \u0041; is the character short identifier here "0041"? Why are there two identifiers naming the same thing? What about lowercase vs. uppercase hex digits? Should we refer to the hexadecimal value somehow?

Yes, the short identifier concept gives more than one identifier for each character. For the character with scalar value 0x4A, all of the following are valid short identifiers: 00004A, 004A, +00004A, +004A, U00004A, U004A, U+00004A, U+004A, 00004a, 004a, +00004a, +004a, U00004a, U004a, U+00004a, U+004a, u00004A, u004A, u+00004A, u+004A, u00004a, u004a, u+00004a, u+004a. Any of those unambiguously identifies the same character. If there were more A-F digits, there would be even more possible identifiers (there are 384 possible short identifiers for 0xAAAAA). The syntax is given with the description I quoted above, and also with the following BNF:

{ U | u } {+}(xxxx | xxxxx | xxxxxx)

where “x” represents one hexadecimal digit (0 to 9, A to F, or a to f), and with the additional requirement that the 5-digit form is not allowed to have leading zeros (so 0041 and 000041 are both valid, but 00041 isn't). I don't know why the choice was made to have this much flexibility.

Referring to the hexadecimal value may actually be a better choice; that wording is actually used in the very next sentence to forbid surrogates. If we want to do it this way, I would rewrite in the following manner.

The character designated by the universal-character-name \UNNNNNNNN is that character whose code point in ISO/IEC 10646 has the hexadecimal value NNNNNNNN; the character designated by the universal-character-name \uNNNN is that character whose code point in ISO/IEC 10646 has the hexadecimal value 0000NNNN.

(That last bit can also be "has the hexadecimal value NNNN", without the leading zeros.)

Also for clarity, ISO 10646 defines "code point" as "value in the UCS codespace", (UCS being short for the character set specified by ISO 10646).

I originally just did s/short name/short identifier/ because that produced minimal changes, but I can rephrase it in terms of hexadecimal value as above if that's preferred.

jensmaurer

Thanks for the explanation. I think the current set of changes is good, although I have a few questions that should be answered by a core issue. For example, what happens if I say \U99004141. Is that ill-formed or undefined behavior or something else? Also, I would be very much in favor of harmonizing towards U+1234 references when talking about Unicode characters.

jensmaurer · 2018-06-27T20:45:13Z

Oh, could you please squash all commits and force-push? Thanks. And the commit message should have "[lex.charset]" in front.

tkoeppe · 2018-06-27T20:48:23Z

source/lex.tex

+U00NNNNNN} is that character whose character short identifier in ISO/IEC 10646 is
+\tcode{NNNNNN}; the character designated by the \grammarterm{universal-character-name}
+\tcode{\textbackslash uNNNN} is that character whose character short identifier in
+ISO/IEC 10646 is \tcode{NNNN}. If the hexadecimal value for a


Sorry for the driveby, but why do we say "hexadecimal value"? Why not just "value"? In which way does the value depend on a particular serialization format?

Can we keep that separate, please? This is enough of a tar pit already, and might benefit from a more wholesale rework.

rmartinho · 2018-06-27T20:54:00Z

I'll squash and fix the commit message.
Regarding the core issue, should I open one, then?
And harmonizing to U+ notation, would that need a paper? I can write it if so.

jensmaurer · 2018-06-27T21:05:03Z

@rmartinho, I think with the editorial change we're currently looking at, we've got a good improvement: from "undefined term" to "well-defined term".
I must admit I'm not so enthusiastic trying to meddle with the words here even more, but if you feel like writing a short paper (essentially showing a single-line summary for each issue addressed plus the wording changes), and that also cleans up @tkoeppe's concern, let's go for it. No need to have a core issue on top of that paper.
The thrust of the paper should be to use the terms (such as surrogate pair) from ISO 10646 as defined there and to make sure that we keep all explanations of such terms (e.g. value ranges) to non-normative text.

…10646 ISO 10646 doesn't have "short name".

rmartinho · 2018-06-28T09:58:57Z

Squashed and fixed the commit message. I'll give that paper a thought, then.

zygoloid · 2018-06-29T22:33:10Z

I would like to change from NNNNNN to U+NNNNNN (for this particular wording) in this change; we're already using U+NNNN in other places, and it seems to be the more common form for unambiguously writing Unicode character short identifiers (though I don't know if ISO/IEC 10646 specifies a preference between the valid forms).

zygoloid · 2018-06-29T22:40:59Z

We should also agree on what typeface to use for Unicode short identifiers. In [time.duration.io]p4, we use body text font, complete with its not-especially-aesthetically-appealing plus sign with slightly unsatisfying kerning.

In Table 2, we use teletype font (and no U+ prefix).

http://www.unicode.org/versions/Unicode11.0.0/appA.pdf says that dropping the U+ prefix is appropriate in tables and in ranges, so what we're doing in Table 2 seems fine. It uses body text font, but has a more appealing plus sign than appears in our body font.

zygoloid · 2018-06-29T22:45:47Z

source/lex.tex

-\tcode{NNNNNNNN}; the character designated by the \grammarterm{universal-character-name}
-\tcode{\textbackslash uNNNN} is that character whose character short name in
-ISO/IEC 10646 is \tcode{0000NNNN}. If the hexadecimal value for a
+U00NNNNNN} is that character whose character short identifier in ISO/IEC 10646 is


This rewording has lost the specification for a universal-character-name beginning with \U01 (etc). I think we need a normative change to properly address this -- it doesn't seem right to just remove the specification for these cases, but the old specification is clearly wrong, as there is no character with the specified short identifier.

zygoloid · 2018-06-29T22:56:35Z

Question for CWG: what is the status of a program like:

char32_t x[] = U"\U00110000";  // U+110000 is a Unicode short identifier but there is no such character
char32_t y[] = U"\U01000000"; // U+1000000 is not a Unicode short identifier

jensmaurer · 2018-10-11T19:19:24Z

Regarding @zygoloid questions, it seems we should not require the compiler to contain a list of valid characters. (In particular, since that list is updated from time to time.) Thus, "x" should be syntactically valid and produce the expected number. In contrast, "y" should be ill-formed.

rmartinho · 2019-01-21T13:46:54Z

I have submitted P1139R0 to address the remaining issues as discussed here.

jensmaurer · 2019-03-08T00:20:55Z

Fixed by P1139R2 Address wording issues related to ISO 10646 #2687.

rmartinho changed the title ~~Change "short name" to "short identifier" to match ISO 10646~~ [lex.charset] Change "short name" to "short identifier" to match ISO 10646 Jun 20, 2018

zygoloid force-pushed the master branch from 46b410b to 88dc8aa Compare June 21, 2018 22:15

jensmaurer requested changes Jun 27, 2018

View reviewed changes

jensmaurer approved these changes Jun 27, 2018

View reviewed changes

tkoeppe reviewed Jun 27, 2018

View reviewed changes

[lex.charset] Change "short name" to "short identifier" to match ISO …

88373ce

…10646 ISO 10646 doesn't have "short name".

rmartinho force-pushed the master branch from e3775fe to 88373ce Compare June 28, 2018 09:57

This was referenced Jun 28, 2018

Explicitly disallow unnamed Unicode codepoints in http://eel.is/c++draft/lex.charset#2 sg16-unicode/sg16#8

Closed

D1139R0: Add paper to reword stuff related to ISO 10646 sg16-unicode/sg16#29

Merged

zygoloid reviewed Jun 29, 2018

View reviewed changes

zygoloid added cwg Issue must be reviewed by CWG. and removed cwg Issue must be reviewed by CWG. labels Jun 29, 2018

zygoloid force-pushed the master branch 2 times, most recently from e3dbfe2 to 1a21a65 Compare July 7, 2018 23:19

jensmaurer added the not-editorial Issue is not deemed editorial; the editorial issue is kept open for tracking. label Oct 11, 2018

jensmaurer mentioned this pull request Oct 11, 2018

[lex.charset] ISO/IEC 10646 does not define "character short name" #2109

Closed

rmartinho mentioned this pull request Jan 22, 2019

Add P1139R0 & P1139R1 sg16-unicode/sg16#41

Merged

jensmaurer closed this Mar 8, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[lex.charset] Change "short name" to "short identifier" to match ISO 10646 #2201

[lex.charset] Change "short name" to "short identifier" to match ISO 10646 #2201

rmartinho commented Jun 20, 2018 •

edited by jensmaurer

jensmaurer Jun 27, 2018

jensmaurer Jun 27, 2018

rmartinho Jun 27, 2018 •

edited

jensmaurer left a comment

jensmaurer commented Jun 27, 2018 •

edited

tkoeppe Jun 27, 2018

jensmaurer Jun 27, 2018

rmartinho commented Jun 27, 2018

jensmaurer commented Jun 27, 2018

rmartinho commented Jun 28, 2018

zygoloid commented Jun 29, 2018

zygoloid commented Jun 29, 2018

zygoloid Jun 29, 2018

zygoloid commented Jun 29, 2018

jensmaurer commented Oct 11, 2018

rmartinho commented Jan 21, 2019 •

edited by jensmaurer

jensmaurer commented Mar 8, 2019

[lex.charset] Change "short name" to "short identifier" to match ISO 10646 #2201

[lex.charset] Change "short name" to "short identifier" to match ISO 10646 #2201

Conversation

rmartinho commented Jun 20, 2018 • edited by jensmaurer

jensmaurer Jun 27, 2018

Choose a reason for hiding this comment

jensmaurer Jun 27, 2018

Choose a reason for hiding this comment

rmartinho Jun 27, 2018 • edited

Choose a reason for hiding this comment

jensmaurer left a comment

Choose a reason for hiding this comment

jensmaurer commented Jun 27, 2018 • edited

tkoeppe Jun 27, 2018

Choose a reason for hiding this comment

jensmaurer Jun 27, 2018

Choose a reason for hiding this comment

rmartinho commented Jun 27, 2018

jensmaurer commented Jun 27, 2018

rmartinho commented Jun 28, 2018

zygoloid commented Jun 29, 2018

zygoloid commented Jun 29, 2018

zygoloid Jun 29, 2018

Choose a reason for hiding this comment

zygoloid commented Jun 29, 2018

jensmaurer commented Oct 11, 2018

rmartinho commented Jan 21, 2019 • edited by jensmaurer

jensmaurer commented Mar 8, 2019

rmartinho commented Jun 20, 2018 •

edited by jensmaurer

rmartinho Jun 27, 2018 •

edited

jensmaurer commented Jun 27, 2018 •

edited

rmartinho commented Jan 21, 2019 •

edited by jensmaurer