- From: Misha Wolf <Misha.Wolf@reuters.com>
- Date: Thu, 07 Jun 2001 13:59:55 +0100
- To: xml-editor@w3.org
- Cc: w3c-xml-core-wg@w3.org, w3c-i18n-ig@w3.org
The current discussion on the Unicode Consortium mailing lists re the exact definition of UTF-8 and re a proposed (per)version of UTF-8 with different handling of the surrogate blocks, has caused me to worry about the precise definition of UTF-8 in regard to the XML specification. Having taken a look, I remain worried. Consider: - The first two instances of "UTF-8" in the XML spec are not accompanied by an explicit reference. - The very first instance occurs in the phrase "the UTF-8 and UTF-16 encodings of 10646". The reader may reasonably infer that s/he should look to (some version of) ISO/IEC 10646 for the definition of UTF-8. - The Normative References section provides references for "ISO/IEC 10646" (defined there to be ISO/IEC 10646-1993 plus amendments AM 1 through AM 7) and for ISO/IEC 10646-2000. - The third instance of "UTF-8" in the XML spec is accompanied by a reference to RFC 2279. This reference is located in the Other References section of the XML spec. - The Unicode 2.0 and Unicode 3.0 definitions of UTF-8 allow implementations to accept and interpret UTF-8 octet sequences which many of the definitions of UTF-8 consider to be illegal. These octet sequences are constructed by mapping individual surrogates to UTF-8, resulting in a supplementary character being represented by two 3-octet UTF-8 sequences. This has serious security implications. - Other Unicode Consortium documents tackle these matters in ways that appear to be mutually contradictory. They include: - Corrigendum to Unicode 3.0.1 http://www.unicode.org/unicode/uni2errata/UTF-8_Corrigendum.html - Unicode Technical Report #17, Character Encoding Model http://www.unicode.org/unicode/reports/tr17/ - UTF & BOM http://www.unicode.org/unicode/faq/utf_bom.html <quote> Similarly, it may map the sequence <ED A0 BF ED B0 80> to the Unicode values <D800 DC00>, even though it must never generate it--it must generate the byte sequence <F0 90 80 80> instead. </quote> Please resolve any confusion in the XML specification relating to the definition of UTF-8 and to the processing of illegal octet sequences. Thanks, Misha ----------------------------------------------------------------- Visit our Internet site at http://www.reuters.com Any views expressed in this message are those of the individual sender, except where the sender specifically states them to be the views of Reuters Ltd.
Received on Thursday, 7 June 2001 09:04:41 UTC