RE: Proposed changes to UTF-8 draft from Misha Wolf on 2003-01-10 (ietf-charsets@w3.org from January to March 2003)

From: Misha Wolf <Misha.Wolf@reuters.com>
Date: Fri, 10 Jan 2003 16:38:49 +0000
To: Francois Yergeau <FYergeau@alis.com>, ietf-charsets@iana.org
Message-id: <T5fb55e93ecc407b70766c@DTCSEUVIG3.dtc.lon.ime.reuters.com>

Makes sense.

Misha

-----Original Message-----
From: Francois Yergeau [mailto:FYergeau@alis.com] 
Sent: 10 January 2003 16:24
To: ietf-charsets@iana.org
Subject: Proposed changes to UTF-8 draft

I wish to propose 2 changes to the UTF-8 draft:

(1) restrict to 4-byte sequences, i.e. remove the 5- and 6-byte sequences

(2) refer normatively to Unicode 3.2

The rationale for (1) is that Unicode is restricted to the 0-10FFFF range of
code points and therefore 5- and 6-byte sequences cannot occur.  10646 is
not officially so restricted but has a policy to not encode anything past
10FFFF and has actually removed Private Use Areas beyond 10FFFF to
accomodate Unicode.  Another reason is that there is much Fear, Uncertainty
and Doubt regarding this issue; an example is this mail excerpt received
this morning on the ietf-822@imc.org list:

Bruce Lilly wrote:
>  From the point of view of parsing some stream of octets, 
> according to one "utf-8" specification a certain sequence
> *is* a utf-8 sequence, and according to other "utf-8"
> specifications is is *not* a utf-8 sequence. I.e. one
> cannot design a parser to recognize "utf-8" from a sequence
> of octets unless one specifies *which* of the mutually-incompatible
> "utf-8" specifications is to be used, viz. whether or not the 5-
> and 6-byte sequnces are or are not "utf-8".

It seems worthwhile to close that issue once and for all.

The rationale for (2) is that Unicode 3.2 now has a better, stricter
definition of UTF-8 than 10646.  Specifically, the difference concerns the
encoding of surrogate code points, in the range D800-DFFF.  10646 only has a
Note (presumably non-normative) pointing out that the mapping of those code
points to UTF-8 is undefined; it doesn't make it an error to decode UTF-8 to
those code points, although it discusses other error cases, and therefore
opens the door to the dangerous practice of decoding double-surrogate 6-byte
sequences into a single non-BMP character.  The recent Unicode 3.2 spec of
UTF-8 clearly and squarely forbids this practice and is therefore, IMHO,
what the Internet spec of UTF-8 needs.  Using Unicode is also more
consistent with (1).  10646 could remain as the normative reference for the
characters themselves.

Opinions?

-- 
François Yergeau

-------------------------------------------------------------- --
        Visit our Internet site at http://www.reuters.com

Get closer to the financial markets with Reuters Messaging - for more
information and to register, visit http://www.reuters.com/messaging

Any views expressed in this message are those of  the  individual
sender,  except  where  the sender specifically states them to be
the views of Reuters Ltd.

Received on Friday, 10 January 2003 11:40:12 UTC