Re: XML erratum: UTF-8 from Misha Wolf on 2001-06-07 (xml-editor@w3.org from April to June 2001)

From: Misha Wolf <Misha.Wolf@reuters.com>
Date: Thu, 07 Jun 2001 18:20:12 +0100
To: xml-editor@w3.org
Cc: w3c-xml-core-wg@w3.org, w3c-i18n-ig@w3.org
Message-Id: <B0017126128@euvig1.dtc.lon.ime.reuters.com>

On 07/06/2001 16:49:58 Martin Duerst wrote:
> This is a followup to Misha's mail.
>
> The very careful analysis below by Peter Constable shows
> that the situation may be a little bit better on the Unicode
> side than Misha's mail implied

It is not.

> (once we are at things like
> Unicode codepoints < U+D800, U+DC00 >, production [2] kicks
> in and we get what we want (document rejected).

I am not at all concerned with what happens when the parser decides it
has a Unicode code point.  As Martin says, production [2] kicks in and,
having done so, kicks out any garbage.

I am *very* concerned with what happens *before* the parser decides it
has a Unicode code point, ie during the mapping from a stream of octets
to a stream of code points.  Various Unicode documents *explicitly*
permit the mapping from *two* 3-octet UTF-8 sequences to *one* Unicode
code point.  In doing so, they disagree with the ISO/IEC 10646
definition of UTF-8, as well as with RFC 2279.

Misha

> But the fact that one has to do such a careful analysis
> means that nobody actually is doing it, and there are all
> kinds of assumtions that implementers take.
>
> The mail below is forwarded without getting Peter's explicit
> permission, but it appeared on the public unicode@unicode.org
> mailing list.
>
> Regards,   Martin.

-----------------------------------------------------------------
        Visit our Internet site at http://www.reuters.com

Any views expressed in this message are those of  the  individual
sender,  except  where  the sender specifically states them to be
the views of Reuters Ltd.

Received on Thursday, 7 June 2001 13:26:04 UTC