Re: I18N issues with the XML Specification from Rick Jelliffe on 2000-04-04 (xml-editor@w3.org from April to June 2000)

From: Rick Jelliffe <ricko@gate.sinica.edu.tw>
Date: Wed, 5 Apr 2000 03:43:01 +0800 (CST)
To: xml-editor@w3.org, yergeau@alis.com
cc: w3c-i18n-ig@w3.org
Message-ID: <Pine.GSO.4.21.0004050304110.26981-100000@gate>

On Tue, 4 Apr 2000, Misha Wolf wrote:

> [Autodetection] http://www.w3.org/International/Group/issues/xml/Overview.html#charset.autodetection

I really think the new paragraph suggested in E44 for appendix F gets the
cart before the horse and is unacceptable:

"Note: Since external parsed entities in UTF-16 may begin with any
character, this autodetection does not always work. Also, because
of the overloaded usage it makes of ASCII-valued bytes, the UTF-7 encoding
may fail to be reliably detected."

For the second sentence: this gives the misleading impression that the
autodetection rules are completely defined by Appendix F. In fact Appendix
F merely gives a nice list of the common cases.

UTF-7 can be handled by a smarter routine: as long as the label is present
it can be reliably detected. Rather than say that UTF-7 may be unreliable,
it would be better to put in an example of how it can be detected
reliably, or to remain silent. It is not the general algorithm (find
signature, read text according to encoding family, parse the text to
find encoding attribute) that is faulty, it is that for UTF-7 the last
stage (parsing) is not specified in this version of Appendix F. (UTF-7
text can still be parsed as ASCII but using different delimiter
recognition, surely.)

Why is it true that external parsed entities in UTF-16 may begin with any
character? That is a bug which should be fixed up. In the absense of
overriding higher-level out-of-band signalling, an XML entity must be
required to identify its encoding unambiguously. The wrong thing to do
would be to say "Autodetection is unreliable"--it must be reliable, and
the rest of XML 1.0 must not have anything that prevents it from being
reliable.

To put it another way, if a character encoding cannot reliably be
autodetected, it should be banned from being used with XML. But I have
still yet to find any encodings that fit into this category.

Of course, the wording mooted above is a comment on the particular
details of appendix F. But unless we are careful, people will not see
that the non-normative nature of Appendix F serves to make it
a reference description of a general approach rather than an exhaustive
algorithm that must be implemented in toto by all. I have thought
for a long while that Appendix F was a little inadequate by giving
specifics and neglecting the principles on which they are based.

Rick Jelliffe

Academia Sinica (W3C Member)
w3c-i18n-ig
w3c-xml-schema-wg

Received on Tuesday, 4 April 2000 15:43:17 UTC