- From: <mark.davis@us.ibm.com>
- Date: Mon, 10 Apr 2000 18:58:34 -0600
- To: Tim Bray <tbray@textuality.com>
- cc: John Cowan <jcowan@reutershealth.com>, MURATA Makoto <muraw3c@attglobal.net>, Rick Jelliffe <ricko@gate.sinica.edu.tw>, xml-editor@w3.org, w3c-i18n-ig@w3.org, w3c-xml-core-wg@w3.org
A. There are a great many circumstances where text should not use a BOM. One doesn't want it at the start of the contents of every string or field, for example. If the file system allows charset typing, then it is also useful to avoid, since it prevents problems with concatenation. There are, of course, circumstances where the BOM is quite useful -- no denying that; where a system protocol requires it, as on Windows, it must be included. B. In the context of XML, I believe the corrected formulation should be: 1. If there is a BOM as the first code point, then that establishes one of several Unicode encoding forms, including the endianness. E.g.: 00 00 FE FF => UTF-32, big-endian ( ~= UCS-4: see below) FF FE 00 00 => UTF-32, little-endian ( ~= UCS-4: see below) FE FF => UTF-16, big-endian FF FE => UTF-16, little-endian EF BB BF => UTF-8 If there is an XML encoding declaration, and it disagrees with the BOM, it is a fatal error. 2.a. If there is no BOM as the first codepoint, then "UTF-8", "UTF-16BE", "UTF-16LE", "UTF-32BE", and "UTF-32LE" are treated just like any other encoding. That is, they must have an XML encoding declaration, in which the first characters must be '<?xml'. This looks like the following: 00 00 00 3C: UTF-32BE 3C 00 00 00: UTF-32LE 00 3C 00 3F: UTF-16BE 3C 00 3F 00: UTF-16LE 3C 3F 78 6D: UTF-8... 2.b. If there is no BOM as the first codepoint, then "UTF-16" is treated as an alias for "UTF-16BE", and both "UTF-32" and "UCS-4" are treated as equivalent to "UTF-32BE". C. Note about "UTF-32": The Unicode Consortium recently proposed to ISO/IEC SC2/WG2 that for interoperability the ranges of UCS-4 and UTF-8 should be restricted to the same range as UTF-16. WG2 has accepted this, and it will be slated for balloting. Once it has be formally accepted (and we see no reason why it will not be), all of the Unicode/10646 encoding forms will have precisely the same range of valid codepoints, i.e. 0..10FFFF (minus D800..DFFF, *FFFE and *FFFF). At that time, the terms "UTF-32" and "UCS-4" will become simple aliases of one another. Mark ___ Mark Davis, IBM Center for Java Technology, Cupertino (408) 777-5850 [fax: 5891], mark.davis@us.ibm.com, president@unicode.org http://maps.yahoo.com/py/maps.py?Pyt=Tmap&addr=10275+N.+De+Anza&csz=95014 Tim Bray <tbray@textuality.com>@w3.org on 2000.04.10 17:08:06 Sent by: w3c-i18n-wg-request@w3.org To: Mark Davis/Cupertino/IBM@IBMUS, John Cowan <jcowan@reutershealth.com> cc: MURATA Makoto <muraw3c@attglobal.net>, Rick Jelliffe <ricko@gate.sinica.edu.tw>, xml-editor@w3.org, w3c-i18n-ig@w3.org, w3c-xml-core-wg@w3.org Subject: Re: I18N issues with the XML Specification At 05:24 PM 4/10/00 -0600, mark.davis@us.ibm.com wrote: >There are some guidelines in http://www.unicode.org/unicode/faq/#BOM from which I quote: 3.Where the precise type of the data stream is known (e.g. Unicode big-endian or Unicode little-endian), the BOM should not be used. I think that this assertion is highly questionable in the general case, and completely false in the context of XML. -Tim
Received on Monday, 10 April 2000 20:58:49 UTC