Re: I18N issues with the XML Specification from Rick Jelliffe on 2000-04-05 (xml-editor@w3.org from April to June 2000)

From: Rick Jelliffe <ricko@gate.sinica.edu.tw>
Date: Thu, 6 Apr 2000 07:00:23 +0800 (CST)
To: yergeau@alis.com
cc: xml-editor@w3.org, w3c-i18n-ig@w3.org
Message-ID: <Pine.GSO.4.21.0004060610100.28656-100000@gate>

On Wed, 5 Apr 2000, John Cowan wrote:

> In principle and as XML 1.0 is written, there might be an encoding
> named "UTF-+ADc-" in which case there would be no straightforward
> way of discriminating between it and UTF-7 to a processor which understood
> both.

I think we can afford to wait for this problem to arise. (And in any case,
I think "+" is not an allowed character in a MIME Content-Type Header
Field according to RFC 2045, so the problem would only occur if someone
made an encoding called UTF-n where "n" is any character allowed by MIME
except 7 and where that encoding codes "n" as "+ADC-". I would be
surprised if IANA would let anyone but Unicode/ISO register another
UTF-n, and I would be most surprised if such a UTF-n had that property.
In fact, I would think it most improbable that if XML had propogated 
a method relying on UTF-+ADC- meaning UTF-7 that IANA would register
it with the naughty name. )

> The meaning is that the procedure *of Appendix F* does not reliably
> detect UTF-7.

Yes, that is why I would like better wording. A sentence like

"Limitations: An implementation of autodetection which follows the
algorithm given in this appendix will fail to detect the encoding of a
UTF-7 entity if its XML header contains encoded characters. The
autodetection algorithm given in this appendix may be enhanced to cope
with this and with other rarer or anomalous encodings."

would be fine if the WG does not want to spell out about +ABC-.

(I don't think we need to put in a warning about sending an external
parsed entity in UTF-8 if it starts with an XML header for UTF-7 encoded
as UTF-7.  If one of our i18n hopes is that everything will converge 
towards UTF-*, then I think we  have to be scrupulous to avoid giving
developers the idea that anything to do with character encodings is worse
or more difficult than it is. We need to foster a can-do, "I can do that"
attitude; developers will run a mile if they think things are too hard,
and the direction they run might not be towards UTF-*. If they think the
infrastructure is broken, they won't use it. And this is one part of the
infrastructure that is proving itself not broken AFAIK.)

> That is true of Appendix F autodetection, which is explicitly described
> as non-normative.  The most that is said is that autodetection is
> "not entirely hopeless".

I think this should be removed. If there is no known problem with entities
that have explicit XML headers, then autodetection is not "not entirely
hopeless" but "entirely satisfactory".  (I take that phrase as rhetorical 
rather than descriptive; a palliative to prevent panic and depression by
readers who might come anticipating problems rather than solutions.)

Rick Jelliffe

Received on Wednesday, 5 April 2000 19:00:53 UTC