- From: Misha Wolf <Misha.Wolf@reuters.com>
- Date: Thu, 07 Jun 2001 13:59:55 +0100
- To: xml-editor@w3.org
- Cc: w3c-xml-core-wg@w3.org, w3c-i18n-ig@w3.org
The current discussion on the Unicode Consortium mailing lists re the
exact definition of UTF-8 and re a proposed (per)version of UTF-8 with
different handling of the surrogate blocks, has caused me to worry about
the precise definition of UTF-8 in regard to the XML specification.
Having taken a look, I remain worried. Consider:
- The first two instances of "UTF-8" in the XML spec are not
accompanied by an explicit reference.
- The very first instance occurs in the phrase "the UTF-8 and UTF-16
encodings of 10646". The reader may reasonably infer that s/he
should look to (some version of) ISO/IEC 10646 for the definition of
UTF-8.
- The Normative References section provides references for
"ISO/IEC 10646" (defined there to be ISO/IEC 10646-1993 plus
amendments AM 1 through AM 7) and for ISO/IEC 10646-2000.
- The third instance of "UTF-8" in the XML spec is accompanied by a
reference to RFC 2279. This reference is located in the Other
References section of the XML spec.
- The Unicode 2.0 and Unicode 3.0 definitions of UTF-8 allow
implementations to accept and interpret UTF-8 octet sequences which
many of the definitions of UTF-8 consider to be illegal. These octet
sequences are constructed by mapping individual surrogates to UTF-8,
resulting in a supplementary character being represented by two
3-octet UTF-8 sequences. This has serious security implications.
- Other Unicode Consortium documents tackle these matters in ways that
appear to be mutually contradictory. They include:
- Corrigendum to Unicode 3.0.1
http://www.unicode.org/unicode/uni2errata/UTF-8_Corrigendum.html
- Unicode Technical Report #17, Character Encoding Model
http://www.unicode.org/unicode/reports/tr17/
- UTF & BOM
http://www.unicode.org/unicode/faq/utf_bom.html
<quote>
Similarly, it may map the sequence <ED A0 BF ED B0 80> to the
Unicode values <D800 DC00>, even though it must never generate
it--it must generate the byte sequence <F0 90 80 80> instead.
</quote>
Please resolve any confusion in the XML specification relating to the
definition of UTF-8 and to the processing of illegal octet sequences.
Thanks,
Misha
-----------------------------------------------------------------
Visit our Internet site at http://www.reuters.com
Any views expressed in this message are those of the individual
sender, except where the sender specifically states them to be
the views of Reuters Ltd.
Received on Thursday, 7 June 2001 09:04:41 UTC