- From: Misha Wolf <Misha.Wolf@reuters.com>
- Date: Thu, 07 Jun 2001 18:20:12 +0100
- To: xml-editor@w3.org
- Cc: w3c-xml-core-wg@w3.org, w3c-i18n-ig@w3.org
On 07/06/2001 16:49:58 Martin Duerst wrote: > This is a followup to Misha's mail. > > The very careful analysis below by Peter Constable shows > that the situation may be a little bit better on the Unicode > side than Misha's mail implied It is not. > (once we are at things like > Unicode codepoints < U+D800, U+DC00 >, production [2] kicks > in and we get what we want (document rejected). I am not at all concerned with what happens when the parser decides it has a Unicode code point. As Martin says, production [2] kicks in and, having done so, kicks out any garbage. I am *very* concerned with what happens *before* the parser decides it has a Unicode code point, ie during the mapping from a stream of octets to a stream of code points. Various Unicode documents *explicitly* permit the mapping from *two* 3-octet UTF-8 sequences to *one* Unicode code point. In doing so, they disagree with the ISO/IEC 10646 definition of UTF-8, as well as with RFC 2279. Misha > But the fact that one has to do such a careful analysis > means that nobody actually is doing it, and there are all > kinds of assumtions that implementers take. > > The mail below is forwarded without getting Peter's explicit > permission, but it appeared on the public unicode@unicode.org > mailing list. > > Regards, Martin. ----------------------------------------------------------------- Visit our Internet site at http://www.reuters.com Any views expressed in this message are those of the individual sender, except where the sender specifically states them to be the views of Reuters Ltd.
Received on Thursday, 7 June 2001 13:26:04 UTC