Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[lex.phases] Whitespace characters should be kept for the sake of the stringize operator #6627

Closed
ContingencyOfTautologicalContradictions opened this issue Oct 15, 2023 · 3 comments

Comments

@ContingencyOfTautologicalContradictions
Copy link

The following sentence at https://eel.is/c++draft/lex.phases#1.3 states that it is unspecified whether a sequence of whitespaces other than new-line is replaced by a single whitespace or is maintained as-is:

Whether each nonempty sequence of whitespace characters other than new-line is retained or replaced by one space character is unspecified.

This could be troublesome for scenarios where the integrity of the resulting literal string given by the stringize pre-processor operator is affected if there are multiple whitespaces in play: should these be kept in the resulting literal string or replaced by a single whitespace? Both scenarios would be theoretically fine, because it's unspecified and the implementation could pick any, but it is usually more intuitive to think that it will be maintained as-is, the following example showcases this problematic:

#define foo(x) #x

const char* bar = foo(hello   world);

Denote the 3 whitespaces between hello and world. Here, an implementation could opt to include the 3 whitespaces as-is, "hello world", or replace these by only 1 whitespace, "hello world", while only the former behaviour is intuitive, and the latter could give unexpected results that could lead to erroneous behaviours.

To solve this, i propose the following change on the wording of https://eel.is/c++draft/lex.phases#1.3:

The source file is decomposed into preprocessing tokens ([lex.pptoken]) and sequences
of whitespace characters (including comments). A source file shall not end in a partial
preprocessing token or in a partial comment.9 Each comment is replaced by one space
character. Whitespace characters are retained. As characters from the source file are
consumed to form the next preprocessing token (i.e., not being consumed as part of a
comment or other forms of whitespace), except when matching a c-char-sequence, s
char-sequence, r-char-sequence, h-char-sequence, or q-char-sequence, universal
character-names are recognized and replaced by the designated element of the
translation character set. The process of dividing a source file's characters into
preprocessing tokens is context-dependent.

@ContingencyOfTautologicalContradictions ContingencyOfTautologicalContradictions changed the title Whitespace characters should be kept for the sake of the stringize operator [lex.phases] Whitespace characters should be kept for the sake of the stringize operator Oct 15, 2023
@jensmaurer
Copy link
Member

What are the major implementations currently doing for your example?

@ContingencyOfTautologicalContradictions

These replace that with a single-whitespace character, as shown by https://godbolt.org/z/sfMhz9fh3

@jensmaurer
Copy link
Member

Great.

So, by removing the liberty we currently have in the standard, we'd make the major implementations non-conforming. Sounds like this is not an editorial issue (which cannot cause behavior changes), and not a CWG issue (the existing specification is clear).

If you believe you've got good rationale for asking for a change in behavior here, a paper to EWG is the appropriate approach: https://isocpp.org/std/submit-a-proposal

@jensmaurer jensmaurer closed this as not planned Won't fix, can't repro, duplicate, stale Oct 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants