Document Number: | P0482R3 |
---|---|
Date: | 2018-05-07 |
Audience: | SG16 Evolution Working Group Library Evolution Working Group |
Reply-to: | Tom Honermann <tom@honermann.net> |
C++11 introduced support for UTF-8, UTF-16, and UTF-32 encoded string literals via N2249 [N2249]. New char16_t and char32_t types were added to hold values of code units for the UTF-16 and UTF-32 variants, but a new type was not added for the UTF-8 variants. Instead, UTF-8 character literals (added in C++17 via N4197 [N4197]) and UTF-8 string literals were defined in terms of the char type used for the code unit type of ordinary character and string literals. UTF-8 is the only text encoding mandated to be supported by the C++ standard for which there is no distinct code unit type. Lack of a distinct type for UTF-8 encoded character and string literals prevents the use of overloading and template specialization in interfaces designed for interoperability with encoded text. The inability to infer an encoding for narrow characters and strings limits design possibilities and hinders the production of elegant interfaces that work seemlessly in generic code. Library authors must choose to limit encoding support, design interfaces that require users to explicitly specify encodings, or provide distinct interfaces for, at least, the implementation defined execution and UTF-8 encodings.
Whether char is a signed or unsigned type is implementation defined and implementations that use an 8-bit signed char are at a disadvantage with respect to working with UTF-8 encoded text due to the necessity of having to rely on conversions to unsigned types in order to correctly process leading and continuation code units of multi-byte encoded code points.
The lack of a distinct type and the use of a code unit type with a range that does not portably include the full unsigned range of UTF-8 code units presents challenges for working with UTF-8 encoded text that are not present when working with UTF-16 or UTF-32 encoded text. Enclosed is a proposal for a new char8_t fundamental type and related library enhancements intended to remove barriers to working with UTF-8 encoded text and to enable generic interfaces that work with all five of the standard mandated text encodings in a consistent manner.
Consider the following string literal expressions, all of which encode U+0123, LATIN SMALL LETTER G WITH CEDILLA:
The UTF-8, UTF-16, and UTF-32 string literals have well-defined and portable sequences of code unit values. The ordinary and wide string literal code unit sequences depend on the implementation defined execution and execution wide encodings respectively. Code that is designed to work with text encodings must be able to differentiate these strings. This is straight forward for wide, UTF-16, and UTF-32 string literals since they each have a distinct code unit type suitable for differentiation via function overloading or template specialization. But for ordinary and UTF-8 string literals, differentiating between them requires additional information since they have the same code unit type. That additional information might be provided implicitly via differently named functions, or explicitly via additional function or template arguments. For example:
The requirement to, in some way, specify the text encoding, other than through the type of the string, limits the ability to provide elegant encoding sensitive interfaces. Consider the following invocations of the make_text_view function proposed in P0244R2 [P0244R2]:
For each invocation, the encoding of the string literal is known at compile time, so having to explicitly specify the encoding tag is redundant. If UTF-8 string literals had a distinct type, then the encoding type could be inferred, while still allowing an overriding tag to be supplied:
The inability to infer an encoding for narrow strings doesn't just limit the interfaces of new features under consideration. Compromised interfaces are already present in the standard library.
Consider the design of the codecvt class template. The standard specifies the following specializations of codecvt be provided to enable transcoding text from one encoding to another.
#1 performs no conversions. #2 converts between strings encoded in the implementation defined wide and narrow encodings. #3 and #4 convert between either the UTF-16 or UTF-32 encoding and the UTF-8 encoding. Specializations are not currently specified for conversion between the implementation defined narrow and wide encodings and any of the UTF-8, UTF-16, or UTF-32 encodings. However, if support for such conversions were to be added, the desired interfaces are already taken by #1, #3 and #4.
The file system interface adopted for C++17 via P0218R1 [P0218R1] provides an example of a feature that supports all five of the standard mandated encodings, but does so with an asymetric interface due to the inability to overload functions for UTF-8 encoded strings. Class std::filesystem::path provides the following constructors to initialize a path object based on a range of code unit values where the encoding is inferred based on the value type of the range.
§ 30.11.7.2.2 [fs.path.type.cvt] describes how the source encoding is determined based on whether the source range value type is char, wchar_t, char16_t, or char32_t. A range with value type char is interpreted using the implementation defined execution encoding. It is not possible to construct a path object from UTF-8 encoded text using these constructors.
To accommodate UTF-8 encoded text, the file system library specifies the following factory functions. Matching factory functions are not provided for other encodings.
The requirement to construct path objects using one interface for UTF-8 strings vs another interface for all other supported encodings creates unnecessary difficulties for portable code. Consider an application that uses UTF-8 as its internal encoding on POSIX systems, but uses UTF-16 on Windows. Conditional compilation or other abstractions must be implemented and used in otherwise platform neutral code to construct path objects.
The inability to infer an encoding based on string type is not the only challenge posed by use of char as the UTF-8 code unit type. The following code exhibits implementation defined behavior.
UTF-8 leading and continuation code units have values in the range 128 (0x80) to 255 (0xFF). In the common case where char is implemented as a signed 8-bit type with a two's complement representation and a range of -128 (-0x80) to 127 (0x7F), these values exceed the unsigned range of the char type. Such implementations typically encode such code units as unsigned values which are then reinterpreted as signed values when read. In the code above, integral promotion rules result in c being promoted to type int for comparison to the 0x80 operand. if c holds a value corresponding to a leading or continuation code unit value, then its value will be interpreted as negative and the promoted value of type int will likewise be negative. The result is that the comparison is always false for these implementations.
To correct the code above, explicit conversions are required. For example:
Finally, processing of UTF-8 strings is currently subject to an optimization pessimization due to glvalue expressions of type char potentially aliasing objects of other types. Use of a distinct type that does not share this aliasing behavior may allow for further compiler optimizations.
As of November 2017, UTF-8 is now used by more than 90% of all websites [W3Techs]. The C++ standard must improve support for UTF-8 by removing the existing barriers that result in redundant tagging of character encodings, non-generic UTF-8 specific workarounds like u8path, and the need for static casts to examine UTF-8 code unit values.
The proposed changes are intended to bring the standard to the state the author believes it would likely be in had char8_t been added at the same time that char16_t and char32_t were added. This includes the ability to differentiate ordinary and UTF-8 literals in function overloading, template specializations, and user-defined literal operator signatures. The following core language changes are proposed in order to facilitate these capabilities:
The following library changes are proposed to address concerns like those raised in the motivation section above, and to take advantage of the new core features:
These changes necessarily impact backward compatibility as described in the Backward compatibility section.
This proposal does not specify any backward compatibility features other than to retain interfaces that it deprecates. The author believes such features are necessary, but that a single set of such features would unnecessarily compromise the goals of this proposal. Rather, the expectation is that implementations will provide options to enable more fine grained compatibility features.
The following sections discuss backward compatibility impact.
Declarations of arrays of char may currently be initialized with UTF-8 string literals. Under this proposal, such initializations would become ill-formed. This is intended to maintain consistency with initialization of arrays of wchar_t, char16_t, and char32_t, all of which require the initializing string literal to have a matching element type as specified in § 11.6.2 [dcl.init.string].
Implementations are encouraged to add options to allow the above initializations (with a warning) to assist users in migrating their code.
Declarations of variables of type char initialized with a UTF-8 character literal remain well-formed and are initialized following the standard conversion rules.
Under this proposal, UTF-8 string literals no longer bind to references to array of type const char nor do they implicitly convert to pointer to const char. The following code is currently well-formed, but would become ill-formed under this proposal:
Implementations are encouraged to add options to allow the above conversions (with a warning) to assist users in migrating their code. Such options would require allowing aliasing of char and char8_t. Note that it may be useful to permit these conversions only for UTF-8 string literals and not for general expressions of array of char8_t type.
Under this proposal, UTF-8 string and character literals have type array of const char8_t and char8_t respectively. This affects the types deduced for placeholder types and template parameter types.
This change in behavior is a primary objective of this proposal. Implementations are encouraged to add options to disable char8_t support entirely when necessary to preserve compatibility with C++17.
The following code is currently well-formed, and would remain well-formed under this proposal, but would behave differently:
The following code is currently well-formed, but would become ill-formed under this proposal:
These changes in behavior are a primary objective of this proposal. Implementations are encouraged to add options to disable char8_t support entirely when necessary to preserve compatibility with C++17.
The following code is currently well-formed, and would remain well-formed under this proposal, but would behave differently:
This change in behavior is a primary objective of this proposal. Implementations are encouraged to add options to disable char8_t support entirely when necessary to preserve compatibility with C++17.
This proposal includes a new specialization of std::basic_string for the new char8_t type, a new std::u8string type alias, and changes to the u8string and generic_u8string member functions of filesystem::path to return std::u8string instead of std::string. This change renders ill-formed the following code that is currently well-formed.
Implementations are encouraged to add an option that allows implicit conversion of std::u8string to std::string to assist in a gradual migration of code that calls these functions.
This proposal includes new overloads of operator ""s and operator ""sv that return char8_t specializations of std::basic_string and std::basic_string_view respectively. This change renders ill-formed the following code that is currently well-formed.
Implementations are encouraged to add an option that allows implicit conversion of std::u8string to std::string to assist in a gradual migration of code that calls these functions.
UTF-8 literals are maintained as narrow literals in this proposal.
There are several choices for the underlying type of char8_t. Use of unsigned char closely aligns with historical use. Use of uint_least8_t would maintain consistency with how the underlying types of char16_t and char32_t are specified.
This proposal specifies unsigned char as the underlying type as noted in the changes to § 6.7.1 [basic.fundamental] paragraph 5.
An implementation is available in the char8_t branch of a gcc fork hosted on GitHub at https://github.com/tahonermann/gcc/tree/char8_t. This implementation is believed to be complete for both the proposed core language and library features. New -fchar8_t and -fno-char8_t compiler options support enabling and disabling the new features. No backward compatibility features are currently implemented.
Richard Smith implemented support for the proposed core wording changes for the next release of Clang. The changes are guarded by new -fchar8_t and -fno-char8_t matching the gcc implementation. No backward compatibility features are currently implemented. Support for the proposed library features has not yet been implemented in libc++. Richard's changes can be found at http://llvm.org/viewvc/llvm-project?view=revision&revision=331244
These changes are relative to N4713 [N4713]
Where noted, these changes presume the adoption of proposal WG14 N2231 [WG14 N2231] into the ISO/IEC 9899 standard for C, and that the next revision of the C++ standard will update dependencies on ISO/IEC 9899 accordingly.
Add char8_t to the list of keywords in table 5 in 5.11 [lex.key] paragraph 1.
Change in 5.13.3 [lex.ccon] paragraph 3:
A character literal that begins with u8, such as u8'w', is a character literal of typecharchar8_t, known as a UTF-8 character literal.[…]
Change in 5.13.5 [lex.string] paragraph 6:
After translation phase 6, a string-literal that does not begin with an encoding-prefix is an ordinary string literal. An ordinary string literal has type "array of n const char" where n is the size of the string as defined below, has static storage duration (6.6.4), and is initialized with the given characters.
Change in 5.13.5 [lex.string] paragraph 7:
A string-literal that begins with u8, such as u8"asdf", is a UTF-8 string literal, also referred to as a char8_t string literal. A char8_t string literal has type "array of n const char8_t", where n is the size of the string as defined below; each successive element of the object representation (6.7) has the value of the corresponding code unit of the UTF-8 encoding of the string.
Change in 5.13.5 [lex.string] paragraph 8:
Ordinary string literals and UTF-8 string literals are also referred to as narrow string literals.A narrow string literal has type "array of n const char", where n is the size of the string as defined below, and has static storage duration (6.6.4).
Drafting note: The deleted paragraph 8 content was incorporated in the changes to paragraphs 6 and 7.
Remove 5.13.5 [lex.string] paragraph 9:
For a UTF-8 string literal, each successive element of the object representation (6.7) has the value of the corresponding code unit of the UTF-8 encoding of the string.
Drafting note: The paragraph 9 content was incorporated in the changes to paragraph 7.
Change in 5.13.5 [lex.string] paragraph 15:
[…] In a narrow string literal, a universal-character-name may map to more than one char or char8_t element due to multibyte encoding. […]
Change in 6.7.1 [basic.fundamental] paragraph 1:
Objects declaredas characterswith type(char)shall be large enough to store any member of the implementation’s basic character set. If a character from this set is stored in a character object, the integral value of that character object is equal to the value of the single character literal form of that character. It is implementation-defined whether a char object can hold negative values. Characters declared with type char can be explicitly declared unsigned or signed. Plain char, signed char, and unsigned char are three distinct types, collectively callednarrowordinary character types. The ordinary character types and char8_t are collectively called narrow character types. A char, a signed char,andan unsigned char, and a char8_t occupy the same amount of storage and have the same alignment requirements (6.6.5); that is, they have the same object representation. For narrow character types, all bits of the object representation participate in the value representation. [ Note: A bit-field of narrow character type whose length is larger than the number of bits in the object representation of that type has padding bits; see 6.7. — end note ] For unsigned narrow character types, each possible bit pattern of the value representation represents a distinct number. These requirements do not hold for other types. In any particular implementation, a plain char object can take on either the same values as a signed char or an unsigned char; which one is implementation-defined. For each value i of type unsigned char, or char8_t in the range 0 to 255 inclusive, there exists a value j of type char such that the result of an integral conversion (7.8) from i to char is j, and the result of an integral conversion from j to unsigned char or char8_t is i.
Change in 6.7.1 [basic.fundamental] paragraph 5:
[…] Type wchar_t shall have the same size, signedness, and alignment requirements (6.6.5) as one of the other integral types, called its underlying type. Type char8_t denotes a distinct type with the same size, signedness, and alignment as unsigned char, called its underlying type. Types char16_t and char32_t denote distinct types with the same size, signedness, and alignment as uint_least16_t and uint_least32_t, respectively, in <cstdint>, called the underlying types. […]
Change in 6.7.1 [basic.fundamental] paragraph 7:
Types bool, char, char8_t, char16_t, char32_t, wchar_t, and the signed and unsigned integer types are collectively called integral types. […]
Change in 6.7.4 [conv.rank] paragraph 1:
[…]
(1.8) — The ranks of char8_t, char16_t, char32_t, and wchar_t shall equal the ranks of their underlying types (6.7.1).
[…]
Change to footnote 64 associated with 8.3 [expr.arith.conv] paragraph 1 (1.5):
As a consequence, operands of type bool, char8_t, char16_t, char32_t, wchar_t, or an enumerated type are converted to some integral type.
Change in 8.5.2.3 [expr.sizeof] paragraph 1:
[…]sizeof(char), sizeof(signed char) and sizeof(unsigned char) are 1The result of sizeof applied to any of the narrow character types is 1. The result of sizeof applied to any other fundamental type is implementation-defined. […]
Change in 10.1.7.2 [dcl.type.simple] paragraph 1:
The simple type specifiers are
simple-type-specifier:
[…]
char
char8_t
char16_t
char32_t
[…]
Change in table 11 of 10.1.7.2 [dcl.type.simple] paragraph 2:
[…]
Table 11 — simple-type-specifiers and the types they specify
Specifier(s) Type […] […] char “char” unsigned char “unsigned char” signed char “signed char” char8_t “char8_t” char16_t “char16_t” char32_t “char32_t” […] […]
[…]
Change in 11.6 [dcl.init] paragraph 17:
[…]
(17.3) — If the destination type is an array of characters, an array of char8_t, an array of char16_t, an array of char32_t, or an array of wchar_t, and the initializer is a string literal, see 11.6.2.
[…]
Change in 11.6.2 [dcl.init.string] paragraph 1:
An array ofnarrowordinary character type (6.7.1), char8_t array, char16_t array, char32_t array, or wchar_t array can be initialized bya narrowan ordinary string literal, char8_t string literal, char16_t string literal, char32_t string literal, or wide string literal, respectively, […]
Change in 16.5.8 [over.literal] paragraph 3:
The declaration of a literal operator shall have a parameter-declaration-clause equivalent to one of the following:[…]
char
wchar_t
char8_t
char16_t
char32_t
const char*, std::size_t
const wchar_t*, std::size_t
const char8_t*, std::size_t
const char16_t*, std::size_t
const char32_t*, std::size_t
[…]
Change in 20.1 [library.general] paragraph 7:
The strings library (Clause 24) provides support for manipulating text represented as sequences of type char, sequences of type char8_t, sequences of type char16_t, sequences of type char32_t, sequences of type wchar_t, and sequences of any other character-like type.
Change in 20.3.2 [defns.character]:
[…]
[ Note 1 to entry: The term does not mean only char, char8_t, char16_t, char32_t, and wchar_t objects, but any value that can be represented by a type that provides the definitions specified in these Clauses. — end note ]
Change in 21.3.2 [limits.syn]:
[…]
template<> class numeric_limits<char>;
template<> class numeric_limits<signed char>;
template<> class numeric_limits<unsigned char>;
template<> class numeric_limits<char8_t>;
template<> class numeric_limits<char16_t>;
template<> class numeric_limits<char32_t>;
template<> class numeric_limits<wchar_t>;
[…]
Change in 24.2 [char.traits] paragraph 1:
This subclause defines requirements on classes representing character traits, and defines a class template char_traits<charT>, along withfourfive specializations, char_traits<char>, char_traits<char8_t>, char_traits<char16_t>, char_traits<char32_t>, and char_traits<wchar_t>, that satisfy those requirements.
Change in 24.2 [char.traits] paragraph 4:
This subclause specifies a class template, char_traits<charT>, andfourfive explicit specializations of it, char_traits<char>, char_traits<char8_t>, char_traits<char16_t>, char_traits<char32_t>, and char_traits<wchar_t>, all of which appear in the header <string> and satisfy the requirements below.
Drafting note: 24.2p4 appears to unnecessarily duplicate information previously presented in 24.2p1.
Change in 24.2.3 [char.traits.specializations]:
namespace std {
template<> struct char_traits<char>;
template<> struct char_traits<char8_t>;
template<> struct char_traits<char16_t>;
template<> struct char_traits<char16_t>;
template<> struct char_traits<char32_t>;
template<> struct char_traits<wchar_t>;
}
Change in 24.2.3 [char.traits.specializations] paragraph 1:
The header <string> shall definefourfive specializations of the class template char_traits: char_traits<char>, char_traits<char8_t>, char_traits<char16_t>, char_traits<char32_t>, and char_traits<wchar_t>.
Add a new subclause after 24.2.3.1 [char.traits.specializations.char]:
namespace std {
24.2.3.? struct char_traits<char8_t> [char.traits.specializations.char8_t]
template<> struct char_traits<char8_t> {
using char_type = char8_t;
using int_type = unsigned int;
using off_type = streamoff;
using pos_type = u8streampos;
using state_type = mbstate_t;
static constexpr void assign(char_type& c1, const char_type& c2) noexcept;
static constexpr bool eq(char_type c1, char_type c2) noexcept;
static constexpr bool lt(char_type c1, char_type c2) noexcept;
static constexpr int compare(const char_type* s1, const char_type* s2, size_t n);
static constexpr size_t length(const char_type* s);
static constexpr const char_type* find(const char_type* s, size_t n,
const char_type& a);
static char_type* move(char_type* s1, const char_type* s2, size_t n);
static char_type* copy(char_type* s1, const char_type* s2, size_t n);
static char_type* assign(char_type* s, size_t n, char_type a);
static constexpr int_type not_eof(int_type c) noexcept;
static constexpr char_type to_char_type(int_type c) noexcept;
static constexpr int_type to_int_type(char_type c) noexcept;
static constexpr bool eq_int_type(int_type c1, int_type c2) noexcept;
static constexpr int_type eof() noexcept;
};
}
Add paragraph 1:
The type u8streampos shall be an implementation-defined type that satisfies the requirements for pos_type in 30.2.2 and 30.3.
Add paragraph 2:
The two-argument members assign, eq, and lt shall be defined identically to the built-in operators =, ==, and < respectively.
Add paragraph 3:
The member eof() shall return an implementation-defined constant that cannot appear as a valid UTF-8 code unit.
Change in 24.3 [string.classes] paragraph 1:
The header <string> defines the basic_string class template for manipulating varying-length sequences of char-like objects andfourfive typedef-names, string, u8string, u16string, u32string, and wstring, that name the specializations basic_string<char>, basic_string<char8_t>, basic_string<char16_t>, basic_string<char32_t>, and basic_string<wchar_t>, respectively.
Change in 24.3.1 [string.syn]:
Header <string> synopsis
#include <initializer_list>
namespace std {
// 24.2, character traits:
template<class charT> struct char_traits;
template<> struct char_traits<char>;
template<> struct char_traits<char8_t>;
template<> struct char_traits<char16_t>;
template<> struct char_traits<char32_t>;
template<> struct char_traits<wchar_t>;
[…]
// basic_string typedef names
using string = basic_string<char>;
using u8string = basic_string<char8_t>;
using u16string = basic_string<char16_t>;
using u32string = basic_string<char32_t>;
using wstring = basic_string<wchar_t>;
[…]
// 24.3.5, hash support:
template<class T> struct hash;
template<> struct hash<string>;
template<> struct hash<u8string>;
template<> struct hash<u16string>;
template<> struct hash<u32string>;
template<> struct hash<wstring>;
namespace pmr {
template <class charT, class traits = char_traits<charT>>
using basic_string = std::basic_string<charT, traits, polymorphic_allocator<charT>>;
using string = basic_string<char>;
using u8string = basic_string<char8_t>;
using u16string = basic_string<char16_t>;
using u32string = basic_string<char32_t>;
using wstring = basic_string<wchar_t>;
}
inline namespace literals {
inline namespace string_literals {
// 24.3.6, suffix for basic_string literals:
string operator "" s(const char* str, size_t len);
u8string operator "" s(const char8_t* str, size_t len);
u16string operator "" s(const char16_t* str, size_t len);
u32string operator "" s(const char32_t* str, size_t len);
wstring operator "" s(const wchar_t* str, size_t len);
}
}
}
Change in 24.3.5 [basic.string.hash]:
template<> struct hash<string>;
template<> struct hash<u8string>;
template<> struct hash<u16string>;
template<> struct hash<u32string>;
template<> struct hash<wstring>;
Add a new paragraph after 24.3.6 [basic.string.literals] paragraph 1:
u8string operator "" s(const char8_t* str, size_t len);Returns: u8string{str, len}.
Change in 24.4.1 [string.view.synop]:
[…]
// basic_string_view typedef names
using string_view = basic_string_view<char>;
using u8string_view = basic_string_view<char8_t>;
using u16string_view = basic_string_view<char16_t>;
using u32string_view = basic_string_view<char32_t>;
using wstring_view = basic_string_view<wchar_t>;
// 24.4.5, hash support
template<class T> struct hash;
template<> struct hash<string_view>;
template<> struct hash<u8string_view>;
template<> struct hash<u16string_view>;
template<> struct hash<u32string_view>;
template<> struct hash<wstring_view>;
inline namespace literals {
inline namespace string_view_literals {
// 24.4.6, suffix for basic_string_view literals
constexpr string_view operator""sv(const char* str, size_t len) noexcept;
constexpr u8string_view operator""sv(const char8_t* str, size_t len) noexcept;
constexpr u16string_view operator""sv(const char16_t* str, size_t len) noexcept;
constexpr u32string_view operator""sv(const char32_t* str, size_t len) noexcept;
constexpr wstring_view operator""sv(const wchar_t* str, size_t len) noexcept;
}
}
[…]
Change in 24.4.5 [string.view.hash]:
template<> struct hash<string_view>;
template<> struct hash<u8string_view>;
template<> struct hash<u16string_view>;
template<> struct hash<u32string_view>;
template<> struct hash<wstring_view>;
Add a new paragraph after 24.4.6 [string.view.literals] paragraph 1:
constexpr u8string_view operator""sv(const char8_t* str, size_t len) noexcept;Returns: u8string_view{str, len}.
Change in 24.5.5 [cuchar.syn]:
namespace std {
using mbstate_t = see below;
using size_t = see 21.2.4;
size_t mbrtoc8(char8_t* pc8, const char* s, size_t n, mbstate_t* ps);
size_t c8rtomb(char* s, char8_t c8, mbstate_t* ps);
size_t mbrtoc16(char16_t* pc16, const char* s, size_t n, mbstate_t* ps);
size_t c16rtomb(char* s, char16_t c16, mbstate_t* ps);
size_t mbrtoc32(char32_t* pc32, const char* s, size_t n, mbstate_t* ps);
size_t c32rtomb(char* s, char32_t c32, mbstate_t* ps);
}
Change in 24.5.5 [cuchar.syn] paragraph 1:
The contents and meaning of the header <cuchar> are the same as the C standard library header <uchar.h>, except that it does not declare types char8_t, char16_tnor, or char32_t.
See also: ISO C 7.28
Drafting note: The changes to 24.5.5 are dependent upon WG14 adopting N2231 [WG14 N2231]
Change in table 69 of 25.3.1.1.1 [locale.category]:
Table 69 — Locale category facets
Category Includes facets […] […] ctype ctype<char>, ctype<wchar_t>
codecvt<char,char,mbstate_t>
codecvt<char16_t,char,mbstate_t>
codecvt<char32_t,char,mbstate_t>
codecvt<char16_t,char8_t,mbstate_t>
codecvt<char32_t,char8_t,mbstate_t>
codecvt<wchar_t,char,mbstate_t>
[…] […]
Drafting note: The deleted char based codecvt specializations have been deprecated and moved to annex D, [depr.locale.category].
Change in table 70 of 25.3.1.1.1 [locale.category]:
Table 70 — Required specializations
Category Includes facets […] […] ctype ctype_byname<char>, ctype_byname<wchar_t>
codecvt_byname<char,char,mbstate_t>
codecvt_byname<char16_t,char,mbstate_t>
codecvt_byname<char32_t,char,mbstate_t>
codecvt_byname<char16_t,char8_t,mbstate_t>
codecvt_byname<char32_t,char8_t,mbstate_t>
codecvt_byname<wchar_t,char,mbstate_t>
[…] […]
Drafting note: The deleted char based codecvt_byname specializations have been deprecated and moved to annex D, [depr.locale.category].
Change in 25.4.1.4 [locale.codecvt] paragraph 3:
The specializations required in Table 69 (25.3.1.1.1) convert the implementation-defined native character set. codecvt<char, char, mbstate_t> implements a degenerate conversion; it does not convert at all. The specialization codecvt<char16_t,charchar8_t, mbstate_t> converts between the UTF-16 and UTF-8 encoding forms, and the specialization codecvt<char32_t,charchar8_t, mbstate_t> converts between the UTF-32 and UTF-8 encoding forms. codecvt<wchar_t,char,mbstate_t> converts between the native character sets fornarrowordinary and wide characters. Specializations on mbstate_t perform conversion between encodings known to the library implementer. Other encodings can be converted by specializing on a user-defined stateT type. Objects of type stateT can contain any state that is useful to communicate to or from the specialized do_in or do_out members.
Change in 30.3.1 [iosfwd.syn]:
[…]
template<class charT> class char_traits;
template<> class char_traits<char>;
template<> class char_traits<char8_t>;
template<> class char_traits<char16_t>;
template<> class char_traits<char32_t>;
template<> class char_traits<wchar_t>;
[…]
Change in 30.7.4.1 [istream]:
namespace std {
[…]
// 30.7.4.2.3, character extraction templates
template<class charT, class traits>
basic_istream<charT, traits>& operator>>(basic_istream<charT, traits>&, charT&);
template<class traits>
basic_istream<char, traits>& operator>>(basic_istream<char, traits>&, unsigned char&);
template<class traits>
basic_istream<char, traits>& operator>>(basic_istream<char, traits>&, signed char&);
template<class traits>
basic_istream<char, traits>& operator>>(basic_istream<char, traits>&, char8_t&);
template<class charT, class traits>
basic_istream<charT, traits>& operator>>(basic_istream<charT, traits>&, charT*);
template<class traits>
basic_istream<char, traits>& operator>>(basic_istream<char, traits>&, unsigned char*);
template<class traits>
basic_istream<char, traits>& operator>>(basic_istream<char, traits>&, signed char*);
template<class traits>
basic_istream<char, traits>& operator>>(basic_istream<char, traits>&, char8_t*);
}
Change in 30.7.4.2.3 [istream.extractors] paragraph 7:
template<class charT, class traits>
basic_istream<charT, traits>& operator>>(basic_istream<charT, traits>& in, charT* s);
template<class traits>
basic_istream<char, traits>& operator>>(basic_istream<char, traits>& in, unsigned char* s);
template<class traits>
basic_istream<char, traits>& operator>>(basic_istream<char, traits>& in, signed char* s);
template<class traits>
basic_istream<char, traits>& operator>>(basic_istream<char, traits>& in, char8_t* s);
7 Effects: […]
Change in 30.7.4.2.3 [istream.extractors] paragraph 12:
template<class charT, class traits>
basic_istream<charT, traits>& operator>>(basic_istream<charT, traits>& in, charT& c);
template<class traits>
basic_istream<char, traits>& operator>>(basic_istream<char, traits>& in, unsigned char& c);
template<class traits>
basic_istream<char, traits>& operator>>(basic_istream<char, traits>& in, signed char& c);
template<class traits>
basic_istream<char, traits>& operator>>(basic_istream<char, traits>& in, char8_t& c);
12 Effects: […]
Change in 30.7.5.1 [ostream]:
namespace std {
[…]
template<class traits>
basic_ostream<char, traits>& operator<<(basic_ostream<char, traits>&, signed char);
template<class traits>
basic_ostream<char, traits>& operator<<(basic_ostream<char, traits>&, unsigned char);
template<class traits>
basic_ostream<char, traits>& operator<<(basic_ostream<char, traits>&, char8_t);
[…]
template<class traits>
basic_ostream<char, traits>& operator<<(basic_ostream<char, traits>&, const signed char*);
template<class traits>
basic_ostream<char, traits>& operator<<(basic_ostream<char, traits>&, const unsigned char*);
template<class traits>
basic_ostream<char, traits>& operator<<(basic_ostream<char, traits>&, const char8_t*);
}
Change in 30.7.5.2.4 [ostream.inserters.character] paragraph 1:
template<class charT, class traits>
basic_ostream<charT, traits>& operator<<(basic_ostream<charT, traits>& out, charT c);
template<class charT, class traits>
basic_ostream<charT, traits>& operator<<(basic_ostream<charT, traits>& out, char c);
// specialization
template<class traits>
basic_ostream<char, traits>& operator<<(basic_ostream<char, traits>& out, char c);
// signedand, unsigned, and char8_t
template<class traits>
basic_ostream<char, traits>& operator<<(basic_ostream<char, traits>& out, signed char c);
template<class traits>
basic_ostream<char, traits>& operator<<(basic_ostream<char, traits>& out, unsigned char c);
template<class traits>
basic_ostream<char, traits>& operator<<(basic_ostream<char, traits>& out, char8_t c);
1 Effects: […]
Change in 30.7.5.2.4 [ostream.inserters.character] paragraph 3:
template<class charT, class traits>
basic_ostream<charT, traits>& operator<<(basic_ostream<charT, traits>& out, const charT *s);
template<class charT, class traits>
basic_ostream<charT, traits>& operator<<(basic_ostream<charT, traits>& out, const char *s);
template<class traits>
basic_ostream<char, traits>& operator<<(basic_ostream<char, traits>& out, const char *s);
template<class traits>
basic_ostream<char, traits>& operator<<(basic_ostream<char, traits>& out, const signed char *s);
template<class traits>
basic_ostream<char, traits>& operator<<(basic_ostream<char, traits>& out, const unsigned char *s);
template<class traits>
basic_ostream<char, traits>& operator<<(basic_ostream<char, traits>& out, const char8_t *s);
3 Requires: s shall not be a null pointer.
Change in 30.11.4 [fs.req] paragraph 1:
Throughout this subclause, char, wchar_t, char8_t, char16_t, and char32_t are collectively called encoded character types.
Change in 30.11.5 [fs.filesystem.syn]:
// 30.11.7.6.2, path factory functions
template <class Source>
path u8path(const Source& source);
template <class InputIterator>
path u8path(InputIterator first, InputIterator last);
Drafting note: The deleted u8path factory functions have been deprecated and moved to annex D, [depr.fs.path.factory].
Change in 30.11.7 [fs.class.path] paragraph 6:
[…]
std::string string() const;
std::wstring wstring() const;
std::stringu8string u8string() const;
std::u16string u16string() const;
std::u32string u32string() const;
[…]
std::string generic_string() const;
std::wstring generic_wstring() const;
std::stringu8string generic_u8string() const;
std::u16string generic_u16string() const;
std::u32string generic_u32string() const;
[…]
Change in 30.11.7.2.2 [fs.path.type.cvt] paragraph 1:
The native encoding ofa narrowan ordinary character string is the operating system dependent current encoding for pathnames (30.11.7). The native encoding for wide character strings is the implementation-defined execution wide-character set encoding (5.3).
Change in 30.11.7.2.2 [fs.path.type.cvt] subparagraph (2.1):
(2.1) — char: The encoding is the nativenarrowordinary encoding. The method of conversion, if any, is operating system dependent. [ Note: For POSIX-based operating systems path::value_type is char so no conversion from char value type arguments or to char value type return values is performed. For Windows-based operating systems, the nativenarrowordinary encoding is determined by calling a Windows API function. — end note ] [ Note: This results in behavior identical to other C and C++ standard library functions that perform file operations usingnarrowordinary character strings to identify paths. Changing this behavior would be surprising and error prone. — end note ]
Add a new subparagraph after 30.11.7.2.2 [fs.path.type.cvt] subparagraph (2.2):
(2.?) — char8_t: The encoding is UTF-8. The method of conversion is unspecified.
Change in 30.11.7.4.1 [fs.path.construct] subparagraph (7.2):
— Otherwise a conversion is performed using the codecvt<wchar_t, char, mbstate_t> facet of loc, and then a second conversion to the currentnarrowordinary encoding.
Drafting note: Is the requirement for a second conversion stated above correct? codecvt<wchar_t, char, mbstate_t> already converts to the ordinary character encoding.
Change in 30.11.7.4.1 [fs.path.construct] paragraph 8:
[…]
For POSIX-based operating systems, the path is constructed by first using latin1_facet to convert ISO/IEC 8859-1 encoded latin1_string to a wide character string in the native wide encoding (30.11.7.2.2). The resulting wide string is then converted toa narrowan ordinary character pathname string in the current nativenarrowordinary encoding. If the native wide encoding is UTF-16 or UTF-32, and the current nativenarrowordinary encoding is UTF-8, all of the characters in the ISO/IEC 8859-1 character set will be converted to their Unicode representation, but for other nativenarrowordinary encodings some characters may have no representation. […]
Change in 30.11.7.4.6 [fs.path.native.obs] paragraph 8:
std::string string() const;
std::wstring wstring() const;
std::stringu8string u8string() const;
std::u16string u16string() const;
std::u32string u32string() const;
Returns: native().
Change in 30.11.7.4.6 [fs.path.native.obs] paragraph 9:
Remarks: Conversion, if any, is performed as specified by 30.11.7.2.The encoding of the string returned by u8string() is always UTF-8.
Change in 30.11.7.4.7 [fs.path.generic.obs] paragraph 5:
std::string generic_string() const;
std::wstring generic_wstring() const;
std::stringu8string generic_u8string() const;
std::u16string generic_u16string() const;
std::u32string generic_u32string() const;
Returns: The pathname in the generic format.
Change in 30.11.7.4.7 [fs.path.generic.obs] paragraph 6:
Remarks: Conversion, if any, is specified by 30.11.7.2.The encoding of the string returned by generic_u8string() is always UTF-8.
Remove subclause 30.11.7.6.2 [fs.path.factory].
template<class Source> path u8path(const Source& source); template<class InputIterator> path u8path(InputIterator first, InputIterator last);
1 Requires: The source and [first, last) sequences are UTF-8 encoded. The value type of Source and InputIterator is char.
2 Returns:
(2.1) — If value_type is char and the current native narrow encoding (30.11.7.2.2) is UTF-8, return path(source) or path(first, last); otherwise,
(2.2) — if value_type is wchar_t and the native wide encoding is UTF-16, or if value_type is char16_t or char32_t, convert source or [first, last) to a temporary, tmp, of type string_type and return path(tmp); otherwise,
(2.3) — convert source or [first, last) to a temporary, tmp, of type u32string and return path(tmp).
3 Remarks: Argument format conversion (30.11.7.2.1) applies to the arguments for these functions. How Unicode encoding conversions are performed is unspecified.
4 [ Example: A string is to be read from a database that is encoded in UTF-8, and used to create a directory using the native encoding for filenames: For POSIX-based operating systems with the native narrow encoding set to UTF-8, no encoding or type conversion occurs.namespace fs = std::filesystem; std::string utf8_string = read_utf8_data(); fs::create_directory(fs::u8path(utf8_string));
For POSIX-based operating systems with the native narrow encoding not set to UTF-8, a conversion to UTF-32 occurs, followed by a conversion to the current native narrow encoding. Some Unicode characters may have no native character set representation.
For Windows-based operating systems a conversion from UTF-8 to UTF-16 occurs. — end example ]
Drafting note: The u8path factory function templates have been deprecated and moved to annex D, [depr.fs.path.factory].
Change in 32.2 [atomics.syn]:
[…]
// 32.5, lock-free property
#define ATOMIC_BOOL_LOCK_FREE unspecified
#define ATOMIC_CHAR_LOCK_FREE unspecified
#define ATOMIC_CHAR8_T_LOCK_FREE unspecified
#define ATOMIC_CHAR16_T_LOCK_FREE unspecified
#define ATOMIC_CHAR32_T_LOCK_FREE unspecified
#define ATOMIC_WCHAR_T_LOCK_FREE unspecified
[…]
using atomic_ullong = atomic<unsigned long long>;
using atomic_char8_t = atomic<char8_t>;
using atomic_char16_t = atomic<char16_t>;
using atomic_char32_t = atomic<char32_t>;
using atomic_wchar_t = atomic<wchar_t>;
Change in 32.5 [atomics.lockfree]:
#define ATOMIC_BOOL_LOCK_FREE unspecified
#define ATOMIC_CHAR_LOCK_FREE unspecified
#define ATOMIC_CHAR8_T_LOCK_FREE unspecified
#define ATOMIC_CHAR16_T_LOCK_FREE unspecified
#define ATOMIC_CHAR32_T_LOCK_FREE unspecified
#define ATOMIC_WCHAR_T_LOCK_FREE unspecified
[…]
Change in 32.6.2 [atomics.types.int] paragraph 1:
There are specializations of the atomic template for the integral types char, signed char, unsigned char, short, unsigned short, int, unsigned int, long, unsigned long, long long, unsigned long long, char8_t, char16_t, char32_t, wchar_t, and any other types needed by the typedefs in the header <cstdint>. […]
[…]
Change in A.6 [gram.dcl]:
[…]
simple-type-specifier: […]
char
char8_t
char16_t
char32_t
wchar_t
[…]
[…]
Change in C.1.1 [diff.lex]:
[…]
Change: String literals made const.
The type of a string literal is changed from "array of char" to "array of const char". The type of a char8_t string literal is changed from "array of some-integer-type" to "array of const char8_t". The type of a char16_t string literal is changed from "array of some-integer-type" to "array of const char16_t". The type of a char32_t string literal is changed from "array of some-integer-type" to "array of const char32_t". The type of a wide string literal is changed from "array of wchar_t" to "array of const wchar_t".
[…]
Change in C.5.1 [diff.cpp17.lex] paragraph 1:
Affected subclause: 5.11
Change: New keywords
Rationale: Required for new features. The requires keyword is added to introduce constraints through a requires-clause or a requires-expression. The concept keyword is added to enable the definition of concepts (17.6.8). The char8_t keyword is added to differentiate the types of ordinary and UTF-8 literals (5.13.5).
Effect on original feature: Valid ISO C++ 2017 code using concept,orrequires, or char8_t as an identifier is not valid in this International Standard.
Add a new paragraph to C.5.1 [diff.cpp17.lex]:
Affected subclause: 5.13
Change: Type of UTF-8 string and character literals.
Rationale: Required for new features. The changed types enable function overloading, template specialization, and type deduction to distinguish ordinary and UTF-8 string and character literals.
Effect on original feature: Valid ISO C++ 2017 code that depends on UTF-8 string literals having type "array of const char" and UTF-8 character literals having type "char" is not valid in this International Standard.
const auto *u8s = u8"text"; // u8s previously deduced as const char *; now deduced as const char8_t *. const char *ps = u8s; // ill-formed; previously well-formed. auto u8c = u8'c'; // u8c previously deduced as char; now deduced as char8_t. char *pc = &u8c; // ill-formed; previously well-formed. std::string s = u8"text"; // ill-formed; previously well-formed. void f(const char *s); f(u8"text"); // ill-formed; previously well-formed. template<typename> struct ct; template<> struct ct<char> { using type = char; }; ct<decltype(u8'c')>::type x; // ill-formed; previously well-formed.
Add a new subclause after C.5.4 [diff.cpp17.library]:
C.5.? Clause 30: Input/Output library [diff.cpp17.input.output]
Affected subclause: 30.11.7
Change: Return type of filesystem path format observer member functions.
Rationale: Required for new features.
Effect on original feature: Valid ISO C++ 2017 code that depends on the u8string() and generic_u8string() member functions of std::filesystem::path returning std::string is not valid in this International Standard.
std::filesystem::path p; std::string s1 = p.u8string(); // ill-formed; previously well-formed. std::string s2 = p.generic_u8string(); // ill-formed; previously well-formed.
Add a new subclause after D.18 [depr.conversions]:
D.19 Deprecated locale category facets [depr.locale.category]
1 The ctype locale category includes the following facets in addition to those specified in table 69 of 25.3.1.1.1 [locale.category]. codecvt<char16_t, char, mbstate_t> codecvt<char32_t, char, mbstate_t>
2 The ctype locale category includes the following facets in addition to those specified in table 70 of 25.3.1.1.1 [locale.category]. codecvt_byname<char16_t, char, mbstate_t> codecvt_byname<char32_t, char, mbstate_t>
3 The following class template specializations are required in addition to those specified in 25.4.1.4 [locale.codecvt]. The specialization codecvt<char16_t, char, mbstate_t> converts between the UTF-16 and UTF-8 encoding forms, and the specialization codecvt<char32_t, char, mbstate_t> converts between the UTF-32 and UTF-8 encoding forms.
Add another new subclause after D.18 [depr.conversions]:
D.20 Deprecated filesystem path factory functions [depr.fs.path.factory]
1 The header <filesystem> has the following additions: namespace std::filesystem { template <class Source> path u8path(const Source& source); template <class InputIterator> path u8path(InputIterator first, InputIterator last); }
2 Requires: The source and [first, last) sequences are UTF-8 encoded. The value type of Source and InputIterator is char.
3 Returns:
(3.1) — If value_type is char and the current native narrow encoding (30.11.7.2.2) is UTF-8, return path(source) or path(first, last); otherwise,
(3.2) — if value_type is wchar_t and the native wide encoding is UTF-16, or if value_type is char16_t or char32_t, convert source or [first, last) to a temporary, tmp, of type string_type and return path(tmp); otherwise,
(3.3) — convert source or [first, last) to a temporary, tmp, of type u32string and return path(tmp).
4 Remarks: Argument format conversion (30.11.7.2.1) applies to the arguments for these functions. How Unicode encoding conversions are performed is unspecified.
5 [ Example: A string is to be read from a database that is encoded in UTF-8, and used to create a directory using the native encoding for filenames: For POSIX-based operating systems with the native narrow encoding set to UTF-8, no encoding or type conversion occurs.namespace fs = std::filesystem; std::string utf8_string = read_utf8_data(); fs::create_directory(fs::u8path(utf8_string));
For POSIX-based operating systems with the native narrow encoding not set to UTF-8, a conversion to UTF-32 occurs, followed by a conversion to the current native narrow encoding. Some Unicode characters may have no native character set representation.
For Windows-based operating systems a conversion from UTF-8 to UTF-16 occurs. — end example ]
Drafting note: The contents of paragraph 1 correspond to the text removed from 30.11.5 [fs.filesystem.syn]. The contents of paragraphs 2-5 correspond to the text removed from 30.11.7.6.2 [fs.path.factory]
These changes are relative to P0096R5 (Feature-testing recommendations for C++) [P0096R5]
Add a new subclause before 3.4, "C++17 features":
[3.X] C++Maybe features
[1] The following table itemizes changes in consideration for a future WG21 working draft. (Changes that were made as specified in a core or library issue are not generally included.)
[2] The table is sorted by the section of the standard primarily affected. The "Doc. No." column links to the paper itself on the committee web site. The "Macro Name" column links to the relevant portion of the "Detailed explanation and rationale" section of this document. When the recommendation is to change the value of a macro previously recommended to be defined, the "Value" column links to the table entry for the previous recommendation.
[3] For library features, the "Header" column identifies the header that is expected to define the macro, although the macro may also be predefined. For language features, the macro must be predefined.
Significant changes under consideration
Doc. No. Title Primary
SectionMacro Name Value Header P0482R3 char8_t: A type for UTF-8 characters and strings 5.13, 6.7.1 __cpp_char8_t 201803 predefined P0482R3 char8_t: A type for UTF-8 characters and strings 21.3.2, 24.3.1, 24.4.1, 25.2, 30.10.5, 32.2 __cpp_lib_char8_t 201803 <atomic> <filesystem> <istream> <limits> <locale> <ostream> <string> <string_view>
Michael Spencer and Davide C. C. Italiano first proposed adding a new char8_t fundamental type in P0372R0 [P0372R0].
Thanks to Alisdair Meredith for reviewing wording and providing feedback in advance of the Rapperswil meeting.
[W3Techs] |
"Usage of UTF-8 for websites", W3Techs, 2017. https://w3techs.com/technologies/details/en-utf8/all/all |
[N2249] |
Lawrence Crowl,
"New Character Types in C++", N2249, 2007. http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2007/n2249.html |
[N4197] |
Richard Smith,
"Adding u8 character literals", N4197, 2014. http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n4197.html |
[N4713] |
"Working Draft, Standard for Programming Language C++", N4713, 2017. http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2017/n4713.pdf |
[P0096R5] |
Clark Nelson,
"Feature-testing recommendations for C++", P0096R5, 2017. http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2017/p0096r5.html |
[P0372R0] |
Michael Spencer and Davide C. C. Italiano,
"A type for utf-8 data", P0372R0, 2016. http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2016/p0372r0.html |
[P0244R2] |
Tom Honermann,
"Text_view: A C++ concepts and range based character encoding and code
point enumeration library", P0244R2, 2017. http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2017/p0244r2.html |
[P0218R1] |
Beman Dawes,
"Adopt the File System TS for C++17", P0218R1, 2016. http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2016/p0372r0.html |
[WG14 N2231] |
Tom Honermann,
"char8_t: A type for UTF-8 characters and strings", WG14 N2231, 2018. http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2231.htm |