RE: internationalization/ISO10646 question from Kenneth Whistler on 2003-01-03 (ietf-charsets@w3.org from January to March 2003)

From: Kenneth Whistler <kenw@sybase.com>
Date: Fri, 03 Jan 2003 11:59:38 -0800 (PST)
To: imcdonald@sharplabs.com
Cc: ietf-charsets@iana.org, kenw@sybase.com
Message-id: <200301031959.LAA19711@birdie.sybase.com>

Ira McDonald stated:

> But Unicode 3.2 (Unicode Standard Annex #28, March 2002) 
> makes very clear in Table 3.1B "Legal UTF-8 Byte Sequences"
> that there is _not_ a 6-byte UTF-8 representation of non-BMP 
> characters.

Correct. And the text of Unicode 4.0 (forthcoming) will make
this absolutely clear for everyone.
  
> 
> Also, section VIII "Relation to ISO/IEC 10646" of Unicode 3.2
> describes ISO Amendment 1 to ISO/IEC 10646-1:2000, which
> limits future ISO/IEC 10646 code point assignments to the 
> range of UTF-16.

Also correct. This is now a done deal, since the Amendment 1 to
10646-1 is published.

The text in Clause 9 of what will be the third version of
10646 (ISO/IEC 10646:2003, with the two parts merged, also
forthcoming), states:

   NOTE -- To ensure continued interoperability between the
   UTF-16 form and other coded representations of the UCS,
   it is intended that no characters will be allocated to code
   positions in Planes 11 to FF in Group 00 or any planes in
   any other groups.
   
And no private use planes are allocated past Plane 10 (= 16),
so there is nothing to which a 5- or 6-byte form of UTF-8 can
refer to in 10646, other than code positions intended to
be reserved in perpetuity.

> 
> Therefore, UTF-8 is always the _same_ size (4 bytes) for 
> non-BMP characters that both UTF-16 and UTF-32 are.

I think what Murata-san may be worried about are the
ill-formed 6-byte sequences for referring to non-BMP
characters: sequences created by encoding each of a sequence
of two surrogate code points (10646-ese: "unpaired RC-elements")
as a 3-byte "UTF-8" sequence.

Such sequences are unambiguously labelled as ill-formed in
the Unicode Standard. They are illegal in UTF-8 defined by
Annex D of 10646, and illegal in UTF-8 defined in the RFC.

But there is a specification for them: CESU-8. And they
do exist in the wild, so to speak, and they may cause
interoperability problems for people using the supplementary
characters.

--Ken

Received on Friday, 3 January 2003 15:00:19 UTC