Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[lex.charset] p2 \00NNNNNN should use placeholder #2752

Closed
jensmaurer opened this issue Mar 7, 2019 · 4 comments · Fixed by #2768
Closed

[lex.charset] p2 \00NNNNNN should use placeholder #2752

jensmaurer opened this issue Mar 7, 2019 · 4 comments · Fixed by #2768
Assignees

Comments

@jensmaurer
Copy link
Member

The NNNNNN here and in the vicinity are placeholders, not literal characters, and thus should use \placeholder.

@zygoloid
Copy link
Member

Other things to fix in this vicinity:

  • We define only that the character designated by a UCN is a certain other character, but then talk about a UCN "correspond[ing] to a code point" without saying what that means.
  • We don't specify any meaning whatsoever for \UAABBBBBB where AA != 00.

Something like this would seem much better:

The universal-character-name \U00NNNNNN corresponds to the code point U+NNNNNN; the universal-character-name \uNNNN corresponds to the code point U+NNNN. A universal-character-name shall correspond to a code point in ISO/IEC10646 that is not a surrogate code point, and designates the character whose code point short identifier is the corresponding code point.

... except that the term "short identifier" does not actually appear anywhere in the latest version of the Unicode specification (https://www.unicode.org/versions/Unicode12.0.0/UnicodeStandard-12.0.pdf). I'm not sure if that's an ISO 10646 invention, but the Unicode Consortium claims that "The Unicode Standard, Version 12.0 is aligned with Amendments 1 and 2 to ISO/IEC 10646:2017", so I suspect not.

@zygoloid
Copy link
Member

.. except that the term "short identifier" does not actually appear anywhere in the latest version of the Unicode specification

OK, it does appear in ISO 10646. However, there are many different short identifiers defined for each character, so talking about what the short identifier for a character "is" is meaningless. We can talk about the character for which U+blah is a short identifier, though.

Also, U+NNNN is not a code point; a code point is really just a number (expressed in ISO 10646 as a hexadecimal number with no prefix).

@zygoloid
Copy link
Member

Another problem: we say

If a universal-character-name does not correspond to a code point in ISO/IEC 10646

... but what does that mean? What is "a code point in ISO/IEC 10646"? Does this mean a UCN naming an unassigned code point is ill-formed? Or does it just mean the values \U00NNNNNN for which NNNNNN is not actually a code point at all? The term "code point" is effectively defined by ISO/IEC 10646 as an integer between 0 and 10FFFF (hexdecimal, inclusive).

I think our phrasing here is very unclear and confusing. What we're trying to say is something very simple:

A universal-character-name designates the ISO/IEC 10646 character whose code point is the hexadecimal number represented by the sequence of hexadecimal-digits in the universal-character-name. The program is ill-formed if that number is not a code point (that is, it is not in the range [0, 10FFFF] hexadecimal) or if it is the code point of a surrogate character (that is, it is in the range [D800, DFFF] hexadecimal). If a universal-character-name outside the c-char-sequence, s-char-sequence, or r-char-sequence of a character or string literal corresponds to a control character [...]

Any mention of short identifiers appears to be unnecessary circumlocution.

@zygoloid
Copy link
Member

... and it gets worse. There are three kinds of code point that do not correspond to a character: surrogates, noncharacters, and reserved code points. We want to allow the second and third kind in universal-character-names, which means that UCNs do not name characters at all, they just name code points.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants