- From: Martin Duerst <duerst@w3.org>
- Date: Fri, 10 Jan 2003 15:41:37 -0500
- To: Francois Yergeau <FYergeau@alis.com>, ietf-charsets@iana.org
At 11:23 03/01/10 -0500, Francois Yergeau wrote: >I wish to propose 2 changes to the UTF-8 draft: > >(1) restrict to 4-byte sequences, i.e. remove the 5- and 6-byte sequences > >(2) refer normatively to Unicode 3.2 > >The rationale for (1) is that Unicode is restricted to the 0-10FFFF range of >code points and therefore 5- and 6-byte sequences cannot occur. 10646 is >not officially so restricted but has a policy to not encode anything past >10FFFF and has actually removed Private Use Areas beyond 10FFFF to >accomodate Unicode. Another reason is that there is much Fear, Uncertainty >and Doubt regarding this issue; an example is this mail excerpt received >this morning on the ietf-822@imc.org list: > >Bruce Lilly wrote: > > From the point of view of parsing some stream of octets, > > according to one "utf-8" specification a certain sequence > > *is* a utf-8 sequence, and according to other "utf-8" > > specifications is is *not* a utf-8 sequence. I.e. one > > cannot design a parser to recognize "utf-8" from a sequence > > of octets unless one specifies *which* of the mutually-incompatible > > "utf-8" specifications is to be used, viz. whether or not the 5- > > and 6-byte sequnces are or are not "utf-8". > >It seems worthwhile to close that issue once and for all. Just to be sure: Is a 4-byte sequence that encodes a codepoint beyond 10FFFF legal in your new version of the draft or not? Regards, Martin.
Received on Friday, 10 January 2003 16:17:29 UTC