- From: Martin v. Loewis <martin@v.loewis.de>
- Date: Fri, 28 Dec 2001 15:36:18 +0100
- To: www-i18n-comments@w3.org
Looking at the latest revision of the character model, I find some inconsistencies and omissions in section 3.6. The introduction of section 3.6 says that it is vitally important that the *character encoding scheme* is known at all times. Yet, the remainder of the section fails to properly distinguish between character encoding forms (CEF) and character encoding schemes (CES). Specifically, 3.6.1 says that the encoding MUST be UTF-8, UTF-16, or UTF-32. While UTF-8 does indeed denote a CES, UTF-16 and UTF-32 are not names of CESs in Unicode 3.1. Unicode 3 has 3 CESs: UTF-8, UTF-16BE, and UTF-16LE (with UTF-32BE and UTF-32LE added in Unicode 3.1). See http://www.unicode.org/unicode/reports/tr17/ for details. Giving the recommendation that UTF-16 can be specified as a character encoding for W3C standards will likely lead to confusion, as people will think that UTF-16 specifies a serialization of Unicode characters into bytes, which it does not in the Unicode standard. Assuming that UTF-16 (and likewise UCS-2) specify a byte encoding has been a source of problems for many years. People have interpreted that to mean all of the following things: 1. Supposedly according to ISO 10646, UTF-16 is big-endian. 2. Supposedly according to an RFC, UTF-16 encodings must start with a Byte Order Mark. 3. Supposedly according to the same RFC, UTF-16 can be used only when a higher-layer protocol specifies byte order or use of the BOM. Please understand that none of these common interpretations is backed up by the Unicode standard. Instead, Unicode specifies two different CESs based on the CEF UTF-16: UTF-16BE and UTF-16LE. Because of these confusions, I would give the recommendation to allow only UTF-8 when a CES is needed, or to explicitly point out that UTF-16 is meaningless as a CES unless accompanied with a statement on byte order. UTF-16 would be acceptable when a CEF is needed, e.g. in an API that offers 16-bit integers as a primitive type. Regards, Martin
Received on Friday, 28 December 2001 09:36:23 UTC