RE: I18N issues with the XML Specification from François Yergeau on 2000-04-11 (xml-editor@w3.org from April to June 2000)

From: François Yergeau <yergeau@alis.com>
Date: Mon, 10 Apr 2000 22:51:41 -0400
To: mark.davis@us.ibm.com, "'Tim Bray'" <tbray@textuality.com>
Cc: "'John Cowan'" <jcowan@reutershealth.com>, "'MURATA Makoto'" <muraw3c@attglobal.net>, "'Rick Jelliffe'" <ricko@gate.sinica.edu.tw>, xml-editor@w3.org, w3c-i18n-ig@w3.org, w3c-xml-core-wg@w3.org
Message-id: <000e01bfa360$def50250$f46efdcf@fyergeau2.intra.alis.com>

> From: mark.davis@us.ibm.com
> Date: lundi 10 avril 2000 20:59
>
> B. In the context of XML, I believe the corrected formulation
> should be:
>
> 2.a. If there is no BOM as the first codepoint, then "UTF-8",
> "UTF-16BE",
> "UTF-16LE", "UTF-32BE", and "UTF-32LE" are treated just like any other
> encoding. That is, they must have an XML encoding declaration

Not quite.  UTF-8 does not need an encoding declaration, it has been the
default from day one.  I agree with the others: "just like any other
encoding", decoding is fully specified by the tag alone, XML parsers are not
required to support them.

> 2.b. If there is no BOM as the first codepoint, then "UTF-16"
> is treated as
> an alias for "UTF-16BE",

I believe this is in contradiction with the spec.  If you say "UTF-16", you
MUST have a BOM to tell the endianness.  Changing that would be a
significant change, for which I don't really see a justification.

> and both "UTF-32" and "UCS-4" are treated as
> equivalent to "UTF-32BE".

This is not currently in the XML spec, but perhaps these semantics could be
added to the registrations of "UTF-32" and "UCS-4" as MIME charset tags.
Not sure it's a good idea, though.  Why not use a BOM or a specific tag?

--
François

Received on Monday, 10 April 2000 23:16:48 UTC