Re: XML erratum: UTF-8 from Martin Duerst on 2001-06-07 (xml-editor@w3.org from April to June 2001)

From: Martin Duerst <duerst@w3.org>
Date: Fri, 08 Jun 2001 00:49:58 +0900
To: xml-editor@w3.org
Cc: w3c-xml-core-wg@w3.org, w3c-i18n-ig@w3.org
Message-Id: <4.2.0.58.J.20010607221646.037feaa0@sh.w3.mag.keio.ac.jp>
This is a followup to Misha's mail.

The very careful analysis below by Peter Constable shows
that the situation may be a little bit better on the Unicode
side than Misha's mail implied (once we are at things like
Unicode codepoints < U+D800, U+DC00 >, production [2] kicks
in and we get what we want (document rejected).

But the fact that one has to do such a careful analysis
means that nobody actually is doing it, and there are all
kinds of assumtions that implementers take.

The mail below is forwarded without getting Peter's explicit
permission, but it appeared on the public unicode@unicode.org
mailing list.

Regards,   Martin.



>From: Peter_Constable@sil.org
>Subject: Re: UTF-8 syntax
>To: unicode@unicode.org
>Cc: unicore@unicode.org
>X-Mailer: Lotus Notes Release 5.0.5  September 22, 2000
>Date: Thu, 7 Jun 2001 02:32:45 -0500
>X-MIMETrack: Serialize by Router on DFWCOM/Servers/WCT(Release 5.0.7 
>|March 21, 2001) at
>  06/07/2001 02:32:51 AM
>X-archive-position: 475
>Sender: unicore-bounce@unicode.org
>X-original-sender: Peter_Constable@sil.org
>
>
>[ copying to unicoRe as I think there are concerns relevant regarding poor
>handling of the definitions in TUS, and more importantly some problems with
>the definitions ]
>
>
>
>On 06/07/2001 12:34:49 AM DougEwell2 wrote:
>
> >But definition D29 says that a UTF must round-trip these invalid code
>points,
> >so we have no choice but to interpret them as <D800 DC00>.  That is why
> >UTF-8s is ambiguous.  The sequence <ED A0 80 ED B0 80> could be mapped as
> >either <D800 DC00>, because D29 says you have to allow for that, or as
> ><10000>, because that is the real intent.
>
>Well, I don't find round-trip implied in D29, but it does say that the
>mapping from the CCS to 8-bit code sequences is unique:
>
><quote>
>D29  A [Unicode encoding form] transforms each Unicode scalar value into a
>unique sequence of code values.
></quote>
>
>  Thus, U+10000 can be encoded in *only one way* in UTF-8 (or in UTF-8s or
>any other encoding form). D29 states that ambiguity is not allowed.
>
>Also D36 indicates that codepoints are encoded into code units as specified
>in Table 3.1:
>
><quote>
>D36  UTF-8 is the [encoding form] that serializes a Unicode scalar value as
>a sequence of one to four bytes, as specified in Table 3.1.
></quote>
>
>That table clearly requires that U+10000 be encoded as <F0 90 80 80> (and
>D29 tells us that can be the *only* way - not to mention that D36 clearly
>limits to 4-byte sequences). Also, since there is no limitation placed upon
>the range of code points for which this is defined, and since D800 is not
>excluded from the codespace, then U+D800 must be encoded as <ED A0 80>. And
>so, <ED A0 80> must be interpreted as U+D800. Similarly in the case of
>DC00. But the crucial point in this is that ***this is talking about the
>codepoint U+D800 in the Unicode coded character set*** and NOT about a
>UTF-16 code unit. Ditto for DC00. Thus, the definitions as they pertain to
>UTF-8 simply ***do NOT make ANY allowance for <ED A0 80 ED B0 80> to be
>interpreted as U+10000. It is pure *fantasy* that some have tried to
>conventionalise. (Cf comments on UTF-8s below.)
>
>
>
> >Note that UTF-8 is not ambiguous in this regard, unless you permit these
> >so-called "lenient" processors, which I thought were made non-conformant
>by
> >the Corrigendum.
>
>The *definitions* before the Corrigendum were ambiguous as to whether the
>unique representation of e.g. U+0020 was supposed to be <20> or <C0 A0>
>etc. The prose note that followed stated "the shortest form that can
>represent those values shall be used", but that wasn't clearly in the
>definition proper. The corrigendum left no doubt.
>
>However, the definitions before the Corrigendum were ***not in any way***
>ambigous with regard to supplementary plane characters. The ***only***
>sanctioned representation by those definitions was using 4 bytes. The
>Corrigendum did not change that one iota.
>
>As I mentioned in an earlier message, the definitions in Unicode are less
>explicit when it comes to interpretation than they are with regard to
>encoding. For example, D31 says that illegal code values sequences are
>those "that cannot be mapped back to any sequence of Unicode scalar
>values". The problem with this is that the meaning of "mapped back" is
>nowhere defined. The result is that "illegal code value sequence" and
>"irregular code value sequence" are strictly speaking not well defined. We
>are left to infer that "mapped back" means the exact inverse of the mapping
>defined (in the case of UTF-8) in D36. But note: making that inference
>assumes that the mapping in D36 is invertible. That requires that the
>mapping in D36 is injective; i.e. one-to-one, as D29 requires. This
>reinforces that a 6-byte sequence cannot be used to represent a
>supplementary plane character. But not the corrolary: 6-byte sequences
>cannot be mapped back to a Unicode scalar value, and therefore *are
>illegal*. This in spite of the fact that D36(c) in the corregendum defines
>these as "irregular", which itself is defined in D32 as "ill-formed [but]
>not illegal". Thus, if we make the inference regarding the meaning of
>"mapped back" in D31 that seems likely, then D36(c) is logically
>inconsistent with D32.
>
>Again, this entire business of thinking that a 6-byte UTF-8 sequence can
>mean a supplementary-plane character is absolute hogwash that treats the
>definitions in an incredibly sloppy manner. The only way to maintain the
>notion that <ED A0 80 ED B0 80> is to make a different inference regarding
>the meaning of "mapped back" - something other than the inverse of the
>injective mapping in Table 3.1. But that assumption is absolutely wide
>open: we would be left we no limitation as to what "mapped back" actually
>means. Thus, we could map any choice of <A0> or <97> or <C0 80> back to
>U+10000 if we wanted to. But that is absolutely ludicrous. Having ruled out
>the alternative, the notion that a 6-byte sequences can be mapped back into
>a supplemantary-plane character, or any other character, is absolutely
>ludicrous.
>
>
> >  The sequence <ED A0 80 ED B0 80> is every bit as much
> >"overlong" as is <C0 80>.
>
>Absolutely.
>
>
>
>That has been UTF-8. Now, coming back to UTF-8s:
>
> >But definition D29 says that a UTF must round-trip these invalid code
>points,
> >so we have no choice but to interpret them as <D800 DC00>.  That is why
> >UTF-8s is ambiguous.
>
>Not so. All that D29 imposes on UTF-8s is that its mapping from codepoints
>to code units must be injective; i.e. there can be only one sequence for
>any given codepoint. It does not make any further requirements as to the
>nature of the mapping. Therefore, it is possible for UTF-8s to specify that
>the represention of U+10000 is <ED A0 80 ED B0 80> (or anything else, for
>that matter), but it can only specify one representation. D29 requires that
>any UTF-8s, if it were to be defined in Unicode, could *not* be ambiguous.
>
>To the extent that anybody is making use of 6-byte sequences to represent
>supplementary-plane characters today, they are already implementing the
>non-standard UTF-8s. Let's be clear on one thing: they are not implementing
>a variation on UTF-8. (The definitions for UTF-8 do not allow for these
>variations, as I demonstrate above.)
>
>But that does *not* require UTC to reify this as a standard. There are a
>whole lot of people out there using non-standard character encodings. For
>example, there's a bunch of users out there with data in which the code
>unit <80> represents a Devanagari DDHA (or something comparable). Does that
>mean that if a group of users can afford $10,000 and create an organisation
>to represent them, that they can the become full members of the Consortium,
>attend UTC meetings and start convincing the committee that there should be
>a UTF-8x in which the code unit <80> represents U+0922? Of course not. The
>mere existence of implementations should not compel UTC to create a new
>standard encoding form.
>
>The only things that should compel them to do so are (i) if implementations
>are so widespread as to be a de facto standard such that ignoring would
>amount to becoming irrelevant, or (ii) there are compelling technical
>reasons why it would be A Good Thing. In the case of UTF-8s, the technical
>reasons for UTC to make it a standard encoding form have not been shown to
>be compelling. On the contrary, the reasons *not* to do so are several, and
>at least as compelling if not much more so. As for existing implementations
>of UTF-8s, they are decidedly not widespread at this time. Moreover, it
>will make more sense for UTC to oppose that from happening precisely
>because of the lack of compelling technical reasons for it and, more to the
>point, the technical reasons against it. Key among them is the point Rick
>McGowan has made: a multiplicity of encoding forms does not benefit us, but
>only recreates the confusion from which Unicode was intended to extricate
>us.
>
>
>In summary:
>
>- As Doug pointed out, the definitions require that e.g. a code unit
>sequence <ED A0 80 ED B0 80> must be interpreted as the sequence of Unicode
>codepoints < U+D800, U+DC00 > and that it cannot be interpreted as U+10000.
>
>- The definitions in Unicode have never been ambiguous as to the
>representation of supplementary-plane characters in UTF-8 and have never
>allowed for 6-byte sequences. Thus, the entire notion that <ED A0 80 ED B0
>80> can be construed as a UTF-8 sequence meaning U+10000 is grounded in a
>disregard and violation of the definitions of the Unicode standard. I
>maintain that it never should have been and should not now be tolerated.
>
>- D36(c) is logically inconsistent with D32. (Either that, or the
>defintions make the rules of UTF-8 encoding tight but leave the
>interpretation wide open.)
>
>- Contrary to Doug, a UTF-8s could not be made ambiguous if it were defined
>in Unicode. No argument on this basis against a proposed UTF-8s has been
>made.
>
>- There are (presumably) some existing implementations using a private
>encoding form, UTF-8s (the 6-byte "non-shortest" way of representing
>supplementary-plane characters which some have considered deviant UTF-8 but
>which by the definitions cannot be considered in any way to be UTF-8). The
>existence of such implementations does not alone constitute a reason for
>UTC to sanction UTF-8s.
>
>
>
>- Peter
>
>
>---------------------------------------------------------------------------
>Peter Constable
>
>Non-Roman Script Initiative, SIL International
>7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
>Tel: +1 972 708 7485
>E-mail: <peter_constable@sil.org>
>
>
Received on Thursday, 7 June 2001 11:50:17 UTC