Re: Proposed changes to UTF-8 draft from Martin Duerst on 2003-01-10 (ietf-charsets@w3.org from January to March 2003)

From: Martin Duerst <duerst@w3.org>
Date: Fri, 10 Jan 2003 15:41:37 -0500
To: Francois Yergeau <FYergeau@alis.com>, ietf-charsets@iana.org
Message-id: <4.2.0.58.J.20030110153951.00a9fc98@localhost>

At 11:23 03/01/10 -0500, Francois Yergeau wrote:
>I wish to propose 2 changes to the UTF-8 draft:
>
>(1) restrict to 4-byte sequences, i.e. remove the 5- and 6-byte sequences
>
>(2) refer normatively to Unicode 3.2
>
>The rationale for (1) is that Unicode is restricted to the 0-10FFFF range of
>code points and therefore 5- and 6-byte sequences cannot occur.  10646 is
>not officially so restricted but has a policy to not encode anything past
>10FFFF and has actually removed Private Use Areas beyond 10FFFF to
>accomodate Unicode.  Another reason is that there is much Fear, Uncertainty
>and Doubt regarding this issue; an example is this mail excerpt received
>this morning on the ietf-822@imc.org list:
>
>Bruce Lilly wrote:
> >  From the point of view of parsing some stream of octets,
> > according to one "utf-8" specification a certain sequence
> > *is* a utf-8 sequence, and according to other "utf-8"
> > specifications is is *not* a utf-8 sequence. I.e. one
> > cannot design a parser to recognize "utf-8" from a sequence
> > of octets unless one specifies *which* of the mutually-incompatible
> > "utf-8" specifications is to be used, viz. whether or not the 5-
> > and 6-byte sequnces are or are not "utf-8".
>
>It seems worthwhile to close that issue once and for all.

Just to be sure: Is a 4-byte sequence that encodes a codepoint
beyond 10FFFF legal in your new version of the draft or not?


Regards,     Martin.

Received on Friday, 10 January 2003 16:17:29 UTC