Re: XML erratum: UTF-8

On 2001-06-07 I reported [1] that the XML specification doesn't make
100% clear what it means by UTF-8.  Additional points are:

-  ISO/IEC 10646-1993 has been superseded by a newer version, hence its
   use as a normative reference is problematic.  I am not sure how an
   implementor would go about obtaining ISO/IEC 10646-1993 or, more
   specifically, Amendment 2 to ISO/IEC 10646-1993, which defined UTF-8.

-  RFC 2279 also relies on ISO/IEC 10646-1993 for a normative definition
   of UTF-8.

-  RFC 2279 section "6. Security Considerations" does not mention the
   problem of duplicate UTF-8 forms for non-BMP characters, though it
   does warn against other duplicate UTF-8 forms.

-  RFC 2279 section "2. UTF-8 definition" ends with:

      A more detailed algorithm and formulae can be found in [FSS_UTF],
      [UNICODE] or Annex R to [ISO-10646].

   where [UNICODE] refers to:

      The Unicode Consortium, "The Unicode Standard -- Version 2.0",
      Addison-Wesley, 1996.

   This, in turn, has been superseded by newer versions, which change
   the [Unicode Consortium's] definition of UTF-8.

[1] http://lists.w3.org/Archives/Public/xml-editor/2001AprJun/0009

Misha


On 07/06/2001 13:59:55 Misha Wolf wrote:
> The current discussion on the Unicode Consortium mailing lists re the
> exact definition of UTF-8 and re a proposed (per)version of UTF-8 with
> different handling of the surrogate blocks, has caused me to worry about
> the precise definition of UTF-8 in regard to the XML specification.
> Having taken a look, I remain worried.  Consider:
>
> -  The first two instances of "UTF-8" in the XML spec are not
>    accompanied by an explicit reference.
>
> -  The very first instance occurs in the phrase "the UTF-8 and UTF-16
>    encodings of 10646".  The reader may reasonably infer that s/he
>    should look to (some version of) ISO/IEC 10646 for the definition of
>    UTF-8.
>
> -  The Normative References section provides references for
>    "ISO/IEC 10646" (defined there to be ISO/IEC 10646-1993 plus
>    amendments AM 1 through AM 7) and for ISO/IEC 10646-2000.
>
> -  The third instance of "UTF-8" in the XML spec is accompanied by a
>    reference to RFC 2279.  This reference is located in the Other
>    References section of the XML spec.
>
> -  The Unicode 2.0 and Unicode 3.0 definitions of UTF-8 allow
>    implementations to accept and interpret UTF-8 octet sequences which
>    many of the definitions of UTF-8 consider to be illegal.  These octet
>    sequences are constructed by mapping individual surrogates to UTF-8,
>    resulting in a supplementary character being represented by two
>    3-octet UTF-8 sequences.  This has serious security implications.
>
> -  Other Unicode Consortium documents tackle these matters in ways that
>    appear to be mutually contradictory.  They include:
>    -  Corrigendum to Unicode 3.0.1
>       http://www.unicode.org/unicode/uni2errata/UTF-8_Corrigendum.html
>    -  Unicode Technical Report #17, Character Encoding Model
>       http://www.unicode.org/unicode/reports/tr17/
>    -  UTF & BOM
>       http://www.unicode.org/unicode/faq/utf_bom.html
>       <quote>
>          Similarly, it may map the sequence <ED A0 BF ED B0 80> to the
>          Unicode values <D800 DC00>, even though it must never generate
>          it--it must generate the byte sequence <F0 90 80 80> instead.
>       </quote>
>
> Please resolve any confusion in the XML specification relating to the
> definition of UTF-8 and to the processing of illegal octet sequences.
>
> Thanks,
> Misha
>
>

-----------------------------------------------------------------
        Visit our Internet site at http://www.reuters.com

Any views expressed in this message are those of  the  individual
sender,  except  where  the sender specifically states them to be
the views of Reuters Ltd.

Received on Friday, 15 June 2001 14:09:36 UTC