RE: I18N issues with the XML Specification

1. UTF-8: Right -- I wrote too hurriedly.

2. UTF-16: I realize that this would be a change for the spec. Both the UTC
and the RFC do not require a BOM with a designation of UTF-16. If there is
none, it is assumed to be big-endian. Of course, the XML spec can impose a
further restriction on the use of that designation. The only reason to do
so would be forward compatibility, but that reason may be compelling enough
to require use of BOM when there is no "LE" or "BE" suffix.

3. UTF-32/UCS-4: The use of UTF-32 should parallel UTF-16.

Mark
___
Mark Davis, IBM Center for Java Technology, Cupertino
(408) 777-5850 [fax: 5891], mark.davis@us.ibm.com, president@unicode.org
http://maps.yahoo.com/py/maps.py?Pyt=Tmap&addr=10275+N.+De+Anza&csz=95014



François Yergeau <yergeau@alis.com>@w3.org on 2000.04.10 19:51:41

Sent by:  w3c-i18n-wg-request@w3.org


To:   Mark Davis/Cupertino/IBM@IBMUS, "'Tim Bray'" <tbray@textuality.com>
cc:   "'John Cowan'" <jcowan@reutershealth.com>, "'MURATA Makoto'"
      <muraw3c@attglobal.net>, "'Rick Jelliffe'"
      <ricko@gate.sinica.edu.tw>, xml-editor@w3.org, w3c-i18n-ig@w3.org,
      w3c-xml-core-wg@w3.org
Subject:  RE: I18N issues with the XML Specification



> From: mark.davis@us.ibm.com
> Date: lundi 10 avril 2000 20:59
>
> B. In the context of XML, I believe the corrected formulation
> should be:
>
> 2.a. If there is no BOM as the first codepoint, then "UTF-8",
> "UTF-16BE",
> "UTF-16LE", "UTF-32BE", and "UTF-32LE" are treated just like any other
> encoding. That is, they must have an XML encoding declaration

Not quite.  UTF-8 does not need an encoding declaration, it has been the
default from day one.  I agree with the others: "just like any other
encoding", decoding is fully specified by the tag alone, XML parsers are
not
required to support them.

> 2.b. If there is no BOM as the first codepoint, then "UTF-16"
> is treated as
> an alias for "UTF-16BE",

I believe this is in contradiction with the spec.  If you say "UTF-16", you
MUST have a BOM to tell the endianness.  Changing that would be a
significant change, for which I don't really see a justification.

> and both "UTF-32" and "UCS-4" are treated as
> equivalent to "UTF-32BE".
This is not currently in the XML spec, but perhaps these semantics could be
added to the registrations of "UTF-32" and "UCS-4" as MIME charset tags.
Not sure it's a good idea, though.  Why not use a BOM or a specific tag?

--
François

Received on Tuesday, 11 April 2000 14:16:08 UTC