- From: <mark.davis@us.ibm.com>
- Date: Tue, 11 Apr 2000 12:15:43 -0600
- To: François Yergeau <yergeau@alis.com>
- cc: "'Tim Bray'" <tbray@textuality.com>, "'John Cowan'" <jcowan@reutershealth.com>, "'MURATA Makoto'" <muraw3c@attglobal.net>, "'Rick Jelliffe'" <ricko@gate.sinica.edu.tw>, xml-editor@w3.org, w3c-i18n-ig@w3.org, w3c-xml-core-wg@w3.org
1. UTF-8: Right -- I wrote too hurriedly. 2. UTF-16: I realize that this would be a change for the spec. Both the UTC and the RFC do not require a BOM with a designation of UTF-16. If there is none, it is assumed to be big-endian. Of course, the XML spec can impose a further restriction on the use of that designation. The only reason to do so would be forward compatibility, but that reason may be compelling enough to require use of BOM when there is no "LE" or "BE" suffix. 3. UTF-32/UCS-4: The use of UTF-32 should parallel UTF-16. Mark ___ Mark Davis, IBM Center for Java Technology, Cupertino (408) 777-5850 [fax: 5891], mark.davis@us.ibm.com, president@unicode.org http://maps.yahoo.com/py/maps.py?Pyt=Tmap&addr=10275+N.+De+Anza&csz=95014 François Yergeau <yergeau@alis.com>@w3.org on 2000.04.10 19:51:41 Sent by: w3c-i18n-wg-request@w3.org To: Mark Davis/Cupertino/IBM@IBMUS, "'Tim Bray'" <tbray@textuality.com> cc: "'John Cowan'" <jcowan@reutershealth.com>, "'MURATA Makoto'" <muraw3c@attglobal.net>, "'Rick Jelliffe'" <ricko@gate.sinica.edu.tw>, xml-editor@w3.org, w3c-i18n-ig@w3.org, w3c-xml-core-wg@w3.org Subject: RE: I18N issues with the XML Specification > From: mark.davis@us.ibm.com > Date: lundi 10 avril 2000 20:59 > > B. In the context of XML, I believe the corrected formulation > should be: > > 2.a. If there is no BOM as the first codepoint, then "UTF-8", > "UTF-16BE", > "UTF-16LE", "UTF-32BE", and "UTF-32LE" are treated just like any other > encoding. That is, they must have an XML encoding declaration Not quite. UTF-8 does not need an encoding declaration, it has been the default from day one. I agree with the others: "just like any other encoding", decoding is fully specified by the tag alone, XML parsers are not required to support them. > 2.b. If there is no BOM as the first codepoint, then "UTF-16" > is treated as > an alias for "UTF-16BE", I believe this is in contradiction with the spec. If you say "UTF-16", you MUST have a BOM to tell the endianness. Changing that would be a significant change, for which I don't really see a justification. > and both "UTF-32" and "UCS-4" are treated as > equivalent to "UTF-32BE". This is not currently in the XML spec, but perhaps these semantics could be added to the registrations of "UTF-32" and "UCS-4" as MIME charset tags. Not sure it's a good idea, though. Why not use a BOM or a specific tag? -- François
Received on Tuesday, 11 April 2000 14:16:08 UTC