XML erratum: UTF-8

The current discussion on the Unicode Consortium mailing lists re the
exact definition of UTF-8 and re a proposed (per)version of UTF-8 with
different handling of the surrogate blocks, has caused me to worry about
the precise definition of UTF-8 in regard to the XML specification.
Having taken a look, I remain worried.  Consider:

-  The first two instances of "UTF-8" in the XML spec are not
   accompanied by an explicit reference.

-  The very first instance occurs in the phrase "the UTF-8 and UTF-16
   encodings of 10646".  The reader may reasonably infer that s/he
   should look to (some version of) ISO/IEC 10646 for the definition of
   UTF-8.

-  The Normative References section provides references for
   "ISO/IEC 10646" (defined there to be ISO/IEC 10646-1993 plus
   amendments AM 1 through AM 7) and for ISO/IEC 10646-2000.

-  The third instance of "UTF-8" in the XML spec is accompanied by a
   reference to RFC 2279.  This reference is located in the Other
   References section of the XML spec.

-  The Unicode 2.0 and Unicode 3.0 definitions of UTF-8 allow
   implementations to accept and interpret UTF-8 octet sequences which
   many of the definitions of UTF-8 consider to be illegal.  These octet
   sequences are constructed by mapping individual surrogates to UTF-8,
   resulting in a supplementary character being represented by two
   3-octet UTF-8 sequences.  This has serious security implications.

-  Other Unicode Consortium documents tackle these matters in ways that
   appear to be mutually contradictory.  They include:
   -  Corrigendum to Unicode 3.0.1
      http://www.unicode.org/unicode/uni2errata/UTF-8_Corrigendum.html
   -  Unicode Technical Report #17, Character Encoding Model
      http://www.unicode.org/unicode/reports/tr17/
   -  UTF & BOM
      http://www.unicode.org/unicode/faq/utf_bom.html
      <quote>
         Similarly, it may map the sequence <ED A0 BF ED B0 80> to the
         Unicode values <D800 DC00>, even though it must never generate
         it--it must generate the byte sequence <F0 90 80 80> instead.
      </quote>

Please resolve any confusion in the XML specification relating to the
definition of UTF-8 and to the processing of illegal octet sequences.

Thanks,
Misha



-----------------------------------------------------------------
        Visit our Internet site at http://www.reuters.com

Any views expressed in this message are those of  the  individual
sender,  except  where  the sender specifically states them to be
the views of Reuters Ltd.

Received on Thursday, 7 June 2001 09:04:41 UTC