Proposed changes to UTF-8 draft from Francois Yergeau on 2003-01-10 (ietf-charsets@w3.org from January to March 2003)

From: Francois Yergeau <FYergeau@alis.com>
Date: Fri, 10 Jan 2003 11:23:46 -0500
To: ietf-charsets@iana.org
Message-id: <F7D4BDA0E5A1D14B99D32C022AEB7366A5078D@alis-2k.alis.domain>

I wish to propose 2 changes to the UTF-8 draft:

(1) restrict to 4-byte sequences, i.e. remove the 5- and 6-byte sequences

(2) refer normatively to Unicode 3.2

The rationale for (1) is that Unicode is restricted to the 0-10FFFF range of
code points and therefore 5- and 6-byte sequences cannot occur.  10646 is
not officially so restricted but has a policy to not encode anything past
10FFFF and has actually removed Private Use Areas beyond 10FFFF to
accomodate Unicode.  Another reason is that there is much Fear, Uncertainty
and Doubt regarding this issue; an example is this mail excerpt received
this morning on the ietf-822@imc.org list:

Bruce Lilly wrote:
>  From the point of view of parsing some stream of octets, 
> according to one "utf-8" specification a certain sequence
> *is* a utf-8 sequence, and according to other "utf-8"
> specifications is is *not* a utf-8 sequence. I.e. one
> cannot design a parser to recognize "utf-8" from a sequence
> of octets unless one specifies *which* of the mutually-incompatible
> "utf-8" specifications is to be used, viz. whether or not the 5-
> and 6-byte sequnces are or are not "utf-8".

It seems worthwhile to close that issue once and for all.

The rationale for (2) is that Unicode 3.2 now has a better, stricter
definition of UTF-8 than 10646.  Specifically, the difference concerns the
encoding of surrogate code points, in the range D800-DFFF.  10646 only has a
Note (presumably non-normative) pointing out that the mapping of those code
points to UTF-8 is undefined; it doesn't make it an error to decode UTF-8 to
those code points, although it discusses other error cases, and therefore
opens the door to the dangerous practice of decoding double-surrogate 6-byte
sequences into a single non-BMP character.  The recent Unicode 3.2 spec of
UTF-8 clearly and squarely forbids this practice and is therefore, IMHO,
what the Internet spec of UTF-8 needs.  Using Unicode is also more
consistent with (1).  10646 could remain as the normative reference for the
characters themselves.

Opinions?

-- 
François Yergeau

Received on Friday, 10 January 2003 11:24:57 UTC