- From: Kenneth Whistler <kenw@sybase.com>
- Date: Fri, 12 Apr 2002 09:26:14 -0700 (PDT)
- To: duerst@w3.org
- Cc: ietf-charsets@iana.org
Martin, > At 18:10 02/04/11 -0700, Kenneth Whistler wrote: > > >I agree, even though the Unicode Standard only describes UTF-8 > >out to U+10FFFF. 10646 still gives the full scheme to U-7FFFFFFF, > >and it will be awhile (if ever) before we can change that to > >deprecate all the 5- and 6-byte values. > > I thought ISO had adopted a standing policy on not allocating > anything beyond U+10FFFF. Ken, do you know the exact status of > this? Can you tell us? Mark pointed out the quotation of WG2 Resolution M38.6 in the UAX #19 UTF-32, which was a key point for WG2 in synching up 10646 with the Unicode concept of no use of code points beyond U+10FFFF, to ensure interoperability between the UTF's. However, we can go beyond that now. Amendment 1 to 10646-1:2000 is now officially published. The relevant clause is clause 9.1: <quote> 9.1 Planes reserved for future standardization Planes 11 to FF in Group 00 and all planes in any other groups ... are reserved for future standardization, and thus those code positions shall not be used for any other purpose. Code positions in these planes do not have a mapping to the UTF-16 form (see Annex C). NOTE - To ensure continued interoperability between the UTF-16 form and other coded representations of the UCS, it is intended that no characters will be allocated to code positions in Planes 11 to FF in Group 00 or any planes in any other groups. </quote> Thus the normative text of the standard says everything beyond U+10FFFF is reserved, and thou shalt not use it. The note, while not a normative part of the text, of course, states pretty clearly the intent of WG2 not to encode past U+10FFFF. That doesn't constitute a formal "policy" per se, because neither WG2 nor SC2 has a mechanism for publishing formal policies (only JTC-1 can do so, in the JTC-1 Guidelines, and those don't apply to technical content). However, WG2 has a standing "Principles and Procedures" document that it uses to guide its technical work on 10646, and the intent not to encode past U+10FFFF is reiterated there, as well, for the benefit of the national bodies participating in WG2. But what I was referring to before was the definition of UTF-8, per se, in Annex D. That specification is still done in terms of a transformation of code positions for the entire codespace of 10646, and so defines 1- to 6-octet forms of UTF-8. Changing that to formally constrain it to 1- to 4-octet forms applicable only to U+0000..U+10FFFF, as the Unicode Standard does, was a lower priority for amending 10646, since in effect the unused sequences past F4 8F BF BF "do no harm". No one uses them, no one *can* use them, because they refer to code positions that are reserved and that "shall not be used for any other purpose". At some point it would be nice to go back and add a little more clarificatory text to Annex D in 10646, pointing out that the constraints of Clause 9.1 effectively mean that all UTF-8 sequences past F4 8F BF BF shall not be used, so that in practice UTF-8 is a 1- to 4-octet transform. Note also that UTF-8 is described in Annex D as "an alternative coded representation form for all of the characters of the UCS". Since Clause 9.1 says no characters will be encoded past Plane 11, this implies that UTF-8 will not be used as an alternative representation form for anything encoded past Plane 11 already. But it is easy to miss this point, since the tables spell out everything for all the code positions to U-7FFFFFFF. > > >So I see no good reason > >right now to put RFC 2279 out of synch with 10646, particularly > >if it would slow down a revision of RFC 2279 now. > > I think the new document should clearly state that codepoints above > U+10FFFF cannot be encoded in UTF-16, that the Unicode consortium > won't allocate any codepoints above that, that ISO has some relevant > policy (if they do),... See citation above, which is as close as you will get to a "policy". --Ken
Received on Friday, 12 April 2002 12:27:01 UTC