Re: Proposed changes to UTF-8 draft from Martin Duerst on 2003-01-10 (ietf-charsets@w3.org from January to March 2003)

From: Martin Duerst <duerst@w3.org>
Date: Fri, 10 Jan 2003 14:06:29 -0500
To: Keld Jorn Simonsen <keld@dkuug.dk>, Francois Yergeau <FYergeau@alis.com>
Cc: ietf-charsets@iana.org
Message-id: <4.2.0.58.J.20030110135307.03b06290@localhost>
At 17:55 03/01/10 +0100, Keld J��n Simonsen wrote:
>On Fri, Jan 10, 2003 at 11:23:46AM -0500, Francois Yergeau wrote:
> > I wish to propose 2 changes to the UTF-8 draft:
> >
> > (1) restrict to 4-byte sequences, i.e. remove the 5- and 6-byte sequences

Good idea. HTML and XML have restricted themselves to the
UTF-16 range (and therefore 4-byte sequences) for something
like 5 years now.


> > (2) refer normatively to Unicode 3.2

Another good idea. But are you allowed to make such
changes at this stage?


> > The rationale for (1) is that Unicode is restricted to the 0-10FFFF 
> range of
> > code points and therefore 5- and 6-byte sequences cannot occur.  10646 is
> > not officially so restricted but has a policy to not encode anything past
> > 10FFFF and has actually removed Private Use Areas beyond 10FFFF to
> > accomodate Unicode.  Another reason is that there is much Fear, Uncertainty
> > and Doubt regarding this issue; an example is this mail excerpt received
> > this morning on the ietf-822@imc.org list:
>
>I think you should keep the specification aligned with 10646,
>also in the interest in being liberal in what you accept, an old and
>good IETF practice.

It looks like the difference between Unicode and 10646 on this
point is very similar to the difference on the issue of encoding
of surrogate pairs (see below). Unicode is very explicit
about the issue, while 10646 puts it in a note.


 From a recent mail by Ken Whistler to this same mailing list:

 >>>>
The text in Clause 9 of what will be the third version of
10646 (ISO/IEC 10646:2003, with the two parts merged, also
forthcoming), states:

    NOTE -- To ensure continued interoperability between the
    UTF-16 form and other coded representations of the UCS,
    it is intended that no characters will be allocated to code
    positions in Planes 11 to FF in Group 00 or any planes in
    any other groups.

And no private use planes are allocated past Plane 10 (= 16),
so there is nothing to which a 5- or 6-byte form of UTF-8 can
refer to in 10646, other than code positions intended to
be reserved in perpetuity.
 >>>>

As for 'to be liberal in what you accept', that's a good
practice in some circumstances, but not in every case.
If an application should reject overlong encodings and
similar things, for security reasons, I don't see a reason
to accept 5- or 6-byte sequences.


> > The rationale for (2) is that Unicode 3.2 now has a better, stricter
> > definition of UTF-8 than 10646.  Specifically, the difference concerns the
> > encoding of surrogate code points, in the range D800-DFFF.  10646 only 
> has a
> > Note (presumably non-normative) pointing out that the mapping of those code
> > points to UTF-8 is undefined; it doesn't make it an error to decode 
> UTF-8 to
> > those code points, although it discusses other error cases, and therefore
> > opens the door to the dangerous practice of decoding double-surrogate 
> 6-byte
> > sequences into a single non-BMP character.  The recent Unicode 3.2 spec of
> > UTF-8 clearly and squarely forbids this practice and is therefore, IMHO,
> > what the Internet spec of UTF-8 needs.  Using Unicode is also more
> > consistent with (1).  10646 could remain as the normative reference for the
> > characters themselves.
> >
> > Opinions?
>
>I think we should keep ourselves to open standards whenever possible,
>and avoid industry standards like Unicode if we can.

The IETF definitely prefers open standards, but the IETF's
definition of openness may not be the same as yours. The
full text of Unicode, and in particular the relevant
parts of Unicode 3.2, are available on the Web for free.
As far as I understand, that's not the case for 10646.


>10646 is pretty explicit about not using surrogates in UTF-8,
>as far as I know. Always was.

If in doubt, I prefer 'very explicit' to 'pretty explicit',
in particular because this is a security issue.

Regards,    Martin.
Received on Friday, 10 January 2003 14:09:19 UTC