- From: Kenneth Whistler <kenw@sybase.com>
- Date: Fri, 03 Jan 2003 11:59:38 -0800 (PST)
- To: imcdonald@sharplabs.com
- Cc: ietf-charsets@iana.org, kenw@sybase.com
Ira McDonald stated: > But Unicode 3.2 (Unicode Standard Annex #28, March 2002) > makes very clear in Table 3.1B "Legal UTF-8 Byte Sequences" > that there is _not_ a 6-byte UTF-8 representation of non-BMP > characters. Correct. And the text of Unicode 4.0 (forthcoming) will make this absolutely clear for everyone. > > Also, section VIII "Relation to ISO/IEC 10646" of Unicode 3.2 > describes ISO Amendment 1 to ISO/IEC 10646-1:2000, which > limits future ISO/IEC 10646 code point assignments to the > range of UTF-16. Also correct. This is now a done deal, since the Amendment 1 to 10646-1 is published. The text in Clause 9 of what will be the third version of 10646 (ISO/IEC 10646:2003, with the two parts merged, also forthcoming), states: NOTE -- To ensure continued interoperability between the UTF-16 form and other coded representations of the UCS, it is intended that no characters will be allocated to code positions in Planes 11 to FF in Group 00 or any planes in any other groups. And no private use planes are allocated past Plane 10 (= 16), so there is nothing to which a 5- or 6-byte form of UTF-8 can refer to in 10646, other than code positions intended to be reserved in perpetuity. > > Therefore, UTF-8 is always the _same_ size (4 bytes) for > non-BMP characters that both UTF-16 and UTF-32 are. I think what Murata-san may be worried about are the ill-formed 6-byte sequences for referring to non-BMP characters: sequences created by encoding each of a sequence of two surrogate code points (10646-ese: "unpaired RC-elements") as a 3-byte "UTF-8" sequence. Such sequences are unambiguously labelled as ill-formed in the Unicode Standard. They are illegal in UTF-8 defined by Annex D of 10646, and illegal in UTF-8 defined in the RFC. But there is a specification for them: CESU-8. And they do exist in the wild, so to speak, and they may cause interoperability problems for people using the supplementary characters. --Ken
Received on Friday, 3 January 2003 15:00:19 UTC