- From: Misha Wolf <Misha.Wolf@reuters.com>
- Date: Thu, 07 Jun 2001 18:20:12 +0100
- To: xml-editor@w3.org
- Cc: w3c-xml-core-wg@w3.org, w3c-i18n-ig@w3.org
On 07/06/2001 16:49:58 Martin Duerst wrote:
> This is a followup to Misha's mail.
>
> The very careful analysis below by Peter Constable shows
> that the situation may be a little bit better on the Unicode
> side than Misha's mail implied
It is not.
> (once we are at things like
> Unicode codepoints < U+D800, U+DC00 >, production [2] kicks
> in and we get what we want (document rejected).
I am not at all concerned with what happens when the parser decides it
has a Unicode code point. As Martin says, production [2] kicks in and,
having done so, kicks out any garbage.
I am *very* concerned with what happens *before* the parser decides it
has a Unicode code point, ie during the mapping from a stream of octets
to a stream of code points. Various Unicode documents *explicitly*
permit the mapping from *two* 3-octet UTF-8 sequences to *one* Unicode
code point. In doing so, they disagree with the ISO/IEC 10646
definition of UTF-8, as well as with RFC 2279.
Misha
> But the fact that one has to do such a careful analysis
> means that nobody actually is doing it, and there are all
> kinds of assumtions that implementers take.
>
> The mail below is forwarded without getting Peter's explicit
> permission, but it appeared on the public unicode@unicode.org
> mailing list.
>
> Regards, Martin.
-----------------------------------------------------------------
Visit our Internet site at http://www.reuters.com
Any views expressed in this message are those of the individual
sender, except where the sender specifically states them to be
the views of Reuters Ltd.
Received on Thursday, 7 June 2001 13:26:04 UTC