Update The Reference To The Unicode Standard

Document number: P1025R1
Date: 2018-06-07
Authors: Steve Downey <sdowney2@bloomberg.net>
JeanHeyd Meneide <phdofthehouse@gmail.com>
Martinho Fernandes <cpp@rmf.io>
Audience: Core, LWG, SG16

Changelog

r1 - 2018-06-07: Do not remove the reference to ISO/IEC 10646-1:1993, as it should remain for D.18 to make sense. Remove Fallback Reference section, as it no longer applies. Add Core discussion of not using Unicode Reference until such an algorithm outside of 10646 is proposed.

Abstract

The reference to ISO/IEC 10646 in the C++ Standard should be updated to the stable base standard or any successor standard.

References

P0417R1 : C++17 should refer to ISO/IEC 10646 2014 instead of 1994 (R1)

Preferred New Reference

The Unicode Consortium, the entity responsible for the Unicode standard, documents the preferred citations for the Unicode Standard. The current standard is version 11.0. While we believe the existing reference should be changed to:

The Unicode Standard, Version 11.0 or later

For existing purposes, the C++ Standard is only concerned with character codes and encoding forms. To standardise any Unicode text processing, the algorithms and character data will need to be referenced. We initially believed that we might as well add such a reference now. However, we have decided to only focus on updating the ISO/IEC 10646 reference.

Immediate Effects

The ISO/IEC 10646 Unicode Standard that the C++ Standard refers to predates UTF-16 and UTF-32, instead defining UCS2 and UCS4. Moving to a newer standard would make the former terms well defined in the C++ Standard. It has been argued that the ECMAScript standard referred to uses a newer Unicode standard, in which those terms are defined, so those terms are defined for the C++ Standard by transitive reference. If that argument is accepted, then moving to the newer version makes the intent explicit.

In addition, in 1996, as part of amendments 5, 6 and 7, the original set of Hangul characters were removed and added at a new location, as well as Tibetan characters added again. This places the current citation in the standard of "ISO/IEC 10646-1:1993" in conflict with the version imported by way of the ECMAScript standard. In practice, all implementors adopt the later version for conversion operations.

The Wikipidia article on Unicode has a summary of the changes over the years.

Keeping with the discussion with Core, an undated Unicode reference will only be introduced at the time when a paper actually introducing those algorithms is proposed. This paper will focus on fixing the ISO/IEC 10646 reference.

UCS2 and UCS4 in `codecvt` facets

The last proposal to update the Unicode Standard reference, P0417R1, was entangled with deprecation of UCS2 and UCS4. The remaining references are in the now deprecated codecvt facets [depr.locale.stdcvt.req]. There is resistance to changing those to UTF-16 and UTF-32, since, particularly for UCS2, there are real changes in behavior. UTF-32 can be viewed as UCS4. UTF-16 can not be similarly viewed as UCS2. Since there may be users of the facility depending on the behavior as it was when standardized this paper does not propose changing them, but instead leaves a normative reference to the old ISO/IEC 10646-1:1993 standard that is only used for those facilities.

Keeping from discussion with Core, we keep a normative, dated reference to ISO/IEC 10646-1:1993 and then have an unqualified reference to ISO/IEC 10646 in general to specify the latest. ISO/IEC 10646 is a well-behaved standard that will not break the standard upon update. It is also impossible to observe the difference between UCS4 and UTF-32 for any C++ implementation, therefore the references to UCS4 have been updated to UTF-32, while UCS2 has been left in place due to being semantically and observably different from UTF-16.

`__STDC_ISO_10646__` macro

The macro __STDC_ISO_10646__ in [cpp.predefined] can be left unchanged. The ISO/IEC 10646 version will be the latest version.

Proposed Changes

Add the wording high-lighted in green. Remove the wording ~~high-lighted in red~~.

This proposed wording is in relation to N4750.

1.2 Normative references [intro.refs]

Add to paragraph 1, above 1.7:

— ISO/IEC 10646, Information technology — Universal Coded Character Set (UCS)

— ISO/IEC 10646-1:1993, Information technology — Universal Multiple-Octet Coded Character Set (UCS) — Part 1: Architecture and Basic Multilingual Plane

Add after paragraph 4:

[Note—References to ISO/IEC 10646-1:1993 are used only to support deprecated features (D.18).—end note]

D.18 Deprecated standard code conversion facets [depr.locale.stdcvt]

Change paragraph 2, 2.1:

— The facet shall convert between UTF-8 multibyte sequences and UCS2 or ~~UCS4~~UTF-32 (depending on the size of Elem) within the program.

Change paragraph 3, 3.1:

— The facet shall convert between UTF-16 multibyte sequences and UCS2 or ~~UCS4~~UTF-32 (depending on the size of Elem) within the program.