[lex.charset] p2 \00NNNNNN should use placeholder #2752

jensmaurer · 2019-03-07T22:16:47Z

The NNNNNN here and in the vicinity are placeholders, not literal characters, and thus should use \placeholder.

zygoloid · 2019-03-11T21:46:23Z

Other things to fix in this vicinity:

We define only that the character designated by a UCN is a certain other character, but then talk about a UCN "correspond[ing] to a code point" without saying what that means.
We don't specify any meaning whatsoever for \UAABBBBBB where AA != 00.

Something like this would seem much better:

The universal-character-name \U00NNNNNN corresponds to the code point U+NNNNNN; the universal-character-name \uNNNN corresponds to the code point U+NNNN. A universal-character-name shall correspond to a code point in ISO/IEC10646 that is not a surrogate code point, and designates the character whose code point short identifier is the corresponding code point.

... except that the term "short identifier" does not actually appear anywhere in the latest version of the Unicode specification (https://www.unicode.org/versions/Unicode12.0.0/UnicodeStandard-12.0.pdf). I'm not sure if that's an ISO 10646 invention, but the Unicode Consortium claims that "The Unicode Standard, Version 12.0 is aligned with Amendments 1 and 2 to ISO/IEC 10646:2017", so I suspect not.

zygoloid · 2019-03-11T22:23:41Z

.. except that the term "short identifier" does not actually appear anywhere in the latest version of the Unicode specification

OK, it does appear in ISO 10646. However, there are many different short identifiers defined for each character, so talking about what the short identifier for a character "is" is meaningless. We can talk about the character for which U+blah is a short identifier, though.

Also, U+NNNN is not a code point; a code point is really just a number (expressed in ISO 10646 as a hexadecimal number with no prefix).

zygoloid · 2019-03-11T22:50:39Z

Another problem: we say

If a universal-character-name does not correspond to a code point in ISO/IEC 10646

... but what does that mean? What is "a code point in ISO/IEC 10646"? Does this mean a UCN naming an unassigned code point is ill-formed? Or does it just mean the values \U00NNNNNN for which NNNNNN is not actually a code point at all? The term "code point" is effectively defined by ISO/IEC 10646 as an integer between 0 and 10FFFF (hexdecimal, inclusive).

I think our phrasing here is very unclear and confusing. What we're trying to say is something very simple:

A universal-character-name designates the ISO/IEC 10646 character whose code point is the hexadecimal number represented by the sequence of hexadecimal-digits in the universal-character-name. The program is ill-formed if that number is not a code point (that is, it is not in the range [0, 10FFFF] hexadecimal) or if it is the code point of a surrogate character (that is, it is in the range [D800, DFFF] hexadecimal). If a universal-character-name outside the c-char-sequence, s-char-sequence, or r-char-sequence of a character or string literal corresponds to a control character [...]

Any mention of short identifiers appears to be unnecessary circumlocution.

zygoloid · 2019-03-11T23:11:39Z

... and it gets worse. There are three kinds of code point that do not correspond to a character: surrogates, noncharacters, and reserved code points. We want to allow the second and third kind in universal-character-names, which means that UCNs do not name characters at all, they just name code points.

zygoloid mentioned this issue Mar 11, 2019

Lex.charset fixes #2768

Merged

jensmaurer assigned zygoloid Mar 13, 2019

zygoloid closed this as completed in #2768 Nov 26, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[lex.charset] p2 \00NNNNNN should use placeholder #2752

[lex.charset] p2 \00NNNNNN should use placeholder #2752

jensmaurer commented Mar 7, 2019

zygoloid commented Mar 11, 2019

zygoloid commented Mar 11, 2019

zygoloid commented Mar 11, 2019

zygoloid commented Mar 11, 2019

[lex.charset] p2 \00NNNNNN should use placeholder #2752

[lex.charset] p2 \00NNNNNN should use placeholder #2752

Comments

jensmaurer commented Mar 7, 2019

zygoloid commented Mar 11, 2019

zygoloid commented Mar 11, 2019

zygoloid commented Mar 11, 2019

zygoloid commented Mar 11, 2019