Comments on WD-charmod-20011220 from Martin v. Loewis on 2001-12-28 (www-i18n-comments@w3.org from December 2001)

From: Martin v. Loewis <martin@v.loewis.de>
Date: Fri, 28 Dec 2001 15:36:18 +0100
To: www-i18n-comments@w3.org
Message-Id: <200112281436.fBSEaIU03518@mira.informatik.hu-berlin.de>

Looking at the latest revision of the character model, I find some
inconsistencies and omissions in section 3.6.

The introduction of section 3.6 says that it is vitally important that
the *character encoding scheme* is known at all times. Yet, the
remainder of the section fails to properly distinguish between
character encoding forms (CEF) and character encoding schemes (CES).

Specifically, 3.6.1 says that the encoding MUST be UTF-8, UTF-16, or
UTF-32. While UTF-8 does indeed denote a CES, UTF-16 and UTF-32 are
not names of CESs in Unicode 3.1. Unicode 3 has 3 CESs: UTF-8,
UTF-16BE, and UTF-16LE (with UTF-32BE and UTF-32LE added in Unicode
3.1). See

http://www.unicode.org/unicode/reports/tr17/

for details. Giving the recommendation that UTF-16 can be specified as
a character encoding for W3C standards will likely lead to confusion,
as people will think that UTF-16 specifies a serialization of Unicode
characters into bytes, which it does not in the Unicode standard.

Assuming that UTF-16 (and likewise UCS-2) specify a byte encoding has
been a source of problems for many years. People have interpreted that
to mean all of the following things:
1. Supposedly according to ISO 10646, UTF-16 is big-endian.
2. Supposedly according to an RFC, UTF-16 encodings must start
   with a Byte Order Mark.
3. Supposedly according to the same RFC, UTF-16 can be used only
   when a higher-layer protocol specifies byte order or use of
   the BOM.

Please understand that none of these common interpretations is backed
up by the Unicode standard. Instead, Unicode specifies two different
CESs based on the CEF UTF-16: UTF-16BE and UTF-16LE.

Because of these confusions, I would give the recommendation to allow
only UTF-8 when a CES is needed, or to explicitly point out that
UTF-16 is meaningless as a CES unless accompanied with a statement on
byte order.

UTF-16 would be acceptable when a CEF is needed, e.g. in an API that
offers 16-bit integers as a primitive type.

Regards,
Martin

Received on Friday, 28 December 2001 09:36:23 UTC