- From: Rick Jelliffe <ricko@gate.sinica.edu.tw>
- Date: Wed, 5 Apr 2000 03:43:01 +0800 (CST)
- To: xml-editor@w3.org, yergeau@alis.com
- cc: w3c-i18n-ig@w3.org
On Tue, 4 Apr 2000, Misha Wolf wrote: > [Autodetection] http://www.w3.org/International/Group/issues/xml/Overview.html#charset.autodetection I really think the new paragraph suggested in E44 for appendix F gets the cart before the horse and is unacceptable: "Note: Since external parsed entities in UTF-16 may begin with any character, this autodetection does not always work. Also, because of the overloaded usage it makes of ASCII-valued bytes, the UTF-7 encoding may fail to be reliably detected." For the second sentence: this gives the misleading impression that the autodetection rules are completely defined by Appendix F. In fact Appendix F merely gives a nice list of the common cases. UTF-7 can be handled by a smarter routine: as long as the label is present it can be reliably detected. Rather than say that UTF-7 may be unreliable, it would be better to put in an example of how it can be detected reliably, or to remain silent. It is not the general algorithm (find signature, read text according to encoding family, parse the text to find encoding attribute) that is faulty, it is that for UTF-7 the last stage (parsing) is not specified in this version of Appendix F. (UTF-7 text can still be parsed as ASCII but using different delimiter recognition, surely.) Why is it true that external parsed entities in UTF-16 may begin with any character? That is a bug which should be fixed up. In the absense of overriding higher-level out-of-band signalling, an XML entity must be required to identify its encoding unambiguously. The wrong thing to do would be to say "Autodetection is unreliable"--it must be reliable, and the rest of XML 1.0 must not have anything that prevents it from being reliable. To put it another way, if a character encoding cannot reliably be autodetected, it should be banned from being used with XML. But I have still yet to find any encodings that fit into this category. Of course, the wording mooted above is a comment on the particular details of appendix F. But unless we are careful, people will not see that the non-normative nature of Appendix F serves to make it a reference description of a general approach rather than an exhaustive algorithm that must be implemented in toto by all. I have thought for a long while that Appendix F was a little inadequate by giving specifics and neglecting the principles on which they are based. Rick Jelliffe Academia Sinica (W3C Member) w3c-i18n-ig w3c-xml-schema-wg
Received on Tuesday, 4 April 2000 15:43:17 UTC