- From: Francois Yergeau <FYergeau@alis.com>
- Date: Fri, 10 Jan 2003 11:23:46 -0500
- To: ietf-charsets@iana.org
I wish to propose 2 changes to the UTF-8 draft: (1) restrict to 4-byte sequences, i.e. remove the 5- and 6-byte sequences (2) refer normatively to Unicode 3.2 The rationale for (1) is that Unicode is restricted to the 0-10FFFF range of code points and therefore 5- and 6-byte sequences cannot occur. 10646 is not officially so restricted but has a policy to not encode anything past 10FFFF and has actually removed Private Use Areas beyond 10FFFF to accomodate Unicode. Another reason is that there is much Fear, Uncertainty and Doubt regarding this issue; an example is this mail excerpt received this morning on the ietf-822@imc.org list: Bruce Lilly wrote: > From the point of view of parsing some stream of octets, > according to one "utf-8" specification a certain sequence > *is* a utf-8 sequence, and according to other "utf-8" > specifications is is *not* a utf-8 sequence. I.e. one > cannot design a parser to recognize "utf-8" from a sequence > of octets unless one specifies *which* of the mutually-incompatible > "utf-8" specifications is to be used, viz. whether or not the 5- > and 6-byte sequnces are or are not "utf-8". It seems worthwhile to close that issue once and for all. The rationale for (2) is that Unicode 3.2 now has a better, stricter definition of UTF-8 than 10646. Specifically, the difference concerns the encoding of surrogate code points, in the range D800-DFFF. 10646 only has a Note (presumably non-normative) pointing out that the mapping of those code points to UTF-8 is undefined; it doesn't make it an error to decode UTF-8 to those code points, although it discusses other error cases, and therefore opens the door to the dangerous practice of decoding double-surrogate 6-byte sequences into a single non-BMP character. The recent Unicode 3.2 spec of UTF-8 clearly and squarely forbids this practice and is therefore, IMHO, what the Internet spec of UTF-8 needs. Using Unicode is also more consistent with (1). 10646 could remain as the normative reference for the characters themselves. Opinions? -- François Yergeau
Received on Friday, 10 January 2003 11:24:57 UTC