CWG Issue 787

This is an unofficial snapshot of the ISO/IEC JTC1 SC22 WG21 Core Issues List revision 114a. See http://www.open-std.org/jtc1/sc22/wg21/ for the official list.

2024-04-18

787. Unnecessary lexical undefined behavior

Section: 5.2 [lex.phases] Status: CD2 Submitter: UK Date: 3 March, 2009

N2800 comment UK 9
N2800 comment UK 12

[Voted into WP at March, 2010 meeting.]

There are several instances of undefined behavior in lexical processing:

5.2 [lex.phases] paragraph 1, phase 2: a universal-character-name resulting from a line splice.
5.2 [lex.phases] paragraph 1, phase 2: a file ending without a new-line character or with a new-line character that is spliced away.
5.2 [lex.phases] paragraph 1, phase 4: a universal-character-name resulting from macro token concatenation.
5.8 [lex.header] paragraph 2: ', \, /*, //, or " appearing in a header-name.

These would be more appropriately handled as conditionally-supported behavior, requiring implementations either to document their handling of these constructs or to issue a diagnostic.

Additional note, March, 2009:

The undefined behavior referred to above regarding universal-character-names is the result of the considerations described in the C99 Rationale, section 5.2.1, in the part entitled “UCN models.” Three different models for support of UCNs are described, each involving different conversions between UCNs and wide characters and/or at different times during program translation. Implementations, as well as the specification in a language standard, can employ any of the three, but it must be impossible for a well-defined program to determine which model was actually employed by implementation. The implication of this “equivalence principle” is that any construct that would give different results under the different models must be classified as undefined behavior. For example, an apparent UCN resulting from a line-splice would be recognized as a UCN by an implementation in which all wide characters were translated immediately into UCNs, as described in C++ phase 1, but would not be recognized as a UCN by another implementation in which all UCNs were translated immediately into wide characters (a possibility mentioned parenthetically in C++ phase 1).

There are additional implications for this “equivalence principle” beyond the ones identified in the UK CD comments. See also issue 578; presumably a string like the one in that issue should also be described as having undefined behavior. Also, because C++'s model introduces backslash characters as part of UCNs for any character outside the basic source character set, any header-name that contains such a character (e.g., #include "@.h") will have undefined behavior in C++. This is also the reason that UCNs are translated into wide characters inside raw strings: two of the three models articulated in the C99 Rationale translate to or from UCNs in phase 1, before raw strings are recognized as tokens in phase 3, so raw strings cannot treat UCNs differently from the way they are treated in other contexts. See also issue 789 for similar points regarding trigraphs.

Notes from the October, 2009 meeting:

The CWG decided that the non-UCN aspects of this issue should be resolved, while the overall questions regarding trigraphs, UCNs, and raw strings will be investigated separately.

Proposed resolution (February, 2010):

Change 5.2 [lex.phases] paragraph 1 phase 2 as follows:

...~~If a~~ A source file that is not empty and that does not end in a new-line character, or that ends in a new-line character immediately preceded by a backslash character before any such splicing takes place, ~~the behavior is undefined~~ shall be processed as if an additional new-line character were appended to the file.

Change 5.8 [lex.header] paragraph 2 as follows:

If The appearance of either of the characters ' or \, or of either of the character sequences /* or // ~~appears~~ in a q-char-sequence or a an h-char-sequence is conditionally-supported with implementation-defined semantics, or as is the appearance of the character " ~~appears~~ in a an h-char-sequence~~, the behavior is undefined~~. [Footnote: Thus, a sequences of characters that resembles an escape sequence~~s cause undefined behavior~~ might result in an error, be interpreted as the character corresponding to the escape sequence, or have a completely different meaning, depending on the implementation. —end footnote]