Re: I18N issues with the XML Specification from mark.davis@us.ibm.com on 2000-04-11 (xml-editor@w3.org from April to June 2000)

From: <mark.davis@us.ibm.com>
Date: Mon, 10 Apr 2000 18:58:34 -0600
To: Tim Bray <tbray@textuality.com>
cc: John Cowan <jcowan@reutershealth.com>, MURATA Makoto <muraw3c@attglobal.net>, Rick Jelliffe <ricko@gate.sinica.edu.tw>, xml-editor@w3.org, w3c-i18n-ig@w3.org, w3c-xml-core-wg@w3.org
Message-ID: <872568BE.00055D60.00@d53mta08h.boulder.ibm.com>

A. There are a great many circumstances where text should not use a BOM.
One doesn't want it at the start of the contents of every string or field,
for example. If the file system allows charset typing, then it is also
useful to avoid, since it prevents problems with concatenation. There are,
of course, circumstances where the BOM is quite useful -- no denying that;
where a system protocol requires it, as on Windows, it must be included.

B. In the context of XML, I believe the corrected formulation should be:

1. If there is a BOM as the first code point, then that establishes one of
several Unicode encoding forms, including the endianness. E.g.:

00 00 FE FF => UTF-32, big-endian ( ~= UCS-4: see below)
FF FE 00 00 => UTF-32, little-endian ( ~= UCS-4: see below)

FE FF => UTF-16, big-endian
FF FE => UTF-16, little-endian

EF BB BF => UTF-8

If there is an XML encoding declaration, and it disagrees with the BOM, it
is a fatal error.

2.a. If there is no BOM as the first codepoint, then "UTF-8", "UTF-16BE",
"UTF-16LE", "UTF-32BE", and "UTF-32LE" are treated just like any other
encoding. That is, they must have an XML encoding declaration, in which the
first characters must be '<?xml'. This looks like the following:

00 00 00 3C: UTF-32BE
3C 00 00 00: UTF-32LE
00 3C 00 3F: UTF-16BE
3C 00 3F 00: UTF-16LE
3C 3F 78 6D: UTF-8...

2.b. If there is no BOM as the first codepoint, then "UTF-16" is treated as
an alias for "UTF-16BE", and both "UTF-32" and "UCS-4" are treated as
equivalent to "UTF-32BE".

C. Note about "UTF-32": The Unicode Consortium recently proposed to ISO/IEC
SC2/WG2 that for interoperability the ranges of UCS-4 and UTF-8 should be
restricted to the same range as UTF-16. WG2 has accepted this, and it will
be slated for balloting. Once it has be formally accepted (and we see no
reason why it will not be), all of the Unicode/10646 encoding forms will
have precisely the same range of valid codepoints, i.e. 0..10FFFF (minus
D800..DFFF, *FFFE and *FFFF).

At that time, the terms "UTF-32" and "UCS-4" will become simple aliases of
one another.

Mark
___
Mark Davis, IBM Center for Java Technology, Cupertino
(408) 777-5850 [fax: 5891], mark.davis@us.ibm.com, president@unicode.org
http://maps.yahoo.com/py/maps.py?Pyt=Tmap&addr=10275+N.+De+Anza&csz=95014



Tim Bray <tbray@textuality.com>@w3.org on 2000.04.10 17:08:06

Sent by:  w3c-i18n-wg-request@w3.org


To:   Mark Davis/Cupertino/IBM@IBMUS, John Cowan <jcowan@reutershealth.com>
cc:   MURATA Makoto <muraw3c@attglobal.net>, Rick Jelliffe
      <ricko@gate.sinica.edu.tw>, xml-editor@w3.org, w3c-i18n-ig@w3.org,
      w3c-xml-core-wg@w3.org
Subject:  Re: I18N issues with the XML Specification



At 05:24 PM 4/10/00 -0600, mark.davis@us.ibm.com wrote:
>There are some guidelines in http://www.unicode.org/unicode/faq/#BOM

from which I quote:

 3.Where the precise type of the data stream is known (e.g. Unicode
   big-endian or Unicode little-endian), the BOM should not be used.

I think that this assertion is highly questionable in the general case,
and completely false in the context of XML. -Tim

Received on Monday, 10 April 2000 20:58:49 UTC