[lex.phases] Whitespace characters should be kept for the sake of the stringize operator #6627

ContingencyOfTautologicalContradictions · 2023-10-15T15:57:14Z

The following sentence at https://eel.is/c++draft/lex.phases#1.3 states that it is unspecified whether a sequence of whitespaces other than new-line is replaced by a single whitespace or is maintained as-is:

Whether each nonempty sequence of whitespace characters other than new-line is retained or replaced by one space character is unspecified.

This could be troublesome for scenarios where the integrity of the resulting literal string given by the stringize pre-processor operator is affected if there are multiple whitespaces in play: should these be kept in the resulting literal string or replaced by a single whitespace? Both scenarios would be theoretically fine, because it's unspecified and the implementation could pick any, but it is usually more intuitive to think that it will be maintained as-is, the following example showcases this problematic:

#define foo(x) #x

const char* bar = foo(hello   world);

Denote the 3 whitespaces between hello and world. Here, an implementation could opt to include the 3 whitespaces as-is, "hello world", or replace these by only 1 whitespace, "hello world", while only the former behaviour is intuitive, and the latter could give unexpected results that could lead to erroneous behaviours.

To solve this, i propose the following change on the wording of https://eel.is/c++draft/lex.phases#1.3:

The source file is decomposed into preprocessing tokens ([lex.pptoken]) and sequences
of whitespace characters (including comments). A source file shall not end in a partial
preprocessing token or in a partial comment.9 Each comment is replaced by one space
character. Whitespace characters are retained. As characters from the source file are
consumed to form the next preprocessing token (i.e., not being consumed as part of a
comment or other forms of whitespace), except when matching a c-char-sequence, s
char-sequence, r-char-sequence, h-char-sequence, or q-char-sequence, universal
character-names are recognized and replaced by the designated element of the
translation character set. The process of dividing a source file's characters into
preprocessing tokens is context-dependent.

The text was updated successfully, but these errors were encountered:

jensmaurer · 2023-10-15T16:17:54Z

What are the major implementations currently doing for your example?

ContingencyOfTautologicalContradictions · 2023-10-15T16:26:00Z

These replace that with a single-whitespace character, as shown by https://godbolt.org/z/sfMhz9fh3

jensmaurer · 2023-10-15T17:00:11Z

Great.

So, by removing the liberty we currently have in the standard, we'd make the major implementations non-conforming. Sounds like this is not an editorial issue (which cannot cause behavior changes), and not a CWG issue (the existing specification is clear).

If you believe you've got good rationale for asking for a change in behavior here, a paper to EWG is the appropriate approach: https://isocpp.org/std/submit-a-proposal

ContingencyOfTautologicalContradictions changed the title ~~Whitespace characters should be kept for the sake of the stringize operator~~ [lex.phases] Whitespace characters should be kept for the sake of the stringize operator Oct 15, 2023

jensmaurer closed this as not planned Won't fix, can't repro, duplicate, stale Oct 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[lex.phases] Whitespace characters should be kept for the sake of the stringize operator #6627

[lex.phases] Whitespace characters should be kept for the sake of the stringize operator #6627

ContingencyOfTautologicalContradictions commented Oct 15, 2023 •

edited

jensmaurer commented Oct 15, 2023

ContingencyOfTautologicalContradictions commented Oct 15, 2023

jensmaurer commented Oct 15, 2023

[lex.phases] Whitespace characters should be kept for the sake of the stringize operator #6627

[lex.phases] Whitespace characters should be kept for the sake of the stringize operator #6627

Comments

ContingencyOfTautologicalContradictions commented Oct 15, 2023 • edited

jensmaurer commented Oct 15, 2023

ContingencyOfTautologicalContradictions commented Oct 15, 2023

jensmaurer commented Oct 15, 2023

ContingencyOfTautologicalContradictions commented Oct 15, 2023 •

edited