- From: Rick Jelliffe <ricko@gate.sinica.edu.tw>
- Date: Thu, 6 Apr 2000 03:36:28 +0800 (CST)
- To: MURATA Makoto <muraw3c@attglobal.net>
- cc: xml-editor@w3.org, w3c-i18n-ig@w3.org, w3c-xml-core-wg@w3.org
On Wed, 5 Apr 2000, MURATA Makoto wrote: > In message "Re: I18N issues with the XML Specification", > Rick Jelliffe wrote... > >To put it another way, if a character encoding cannot reliably be > >autodetected, it should be banned from being used with XML. But I have > >still yet to find any encodings that fit into this category. > > In RFC 2781 (UTF-16, an encoding of ISO 10646), we have three dialects > of UTF-16. Their charset names are "utf-16", "utf-16le" (BOM-less > little endian), and "utf-16be" (BOM-less big endian). If several encodings share a (BOM, etc) signature, then the XML header must be specified. When the XML header is specified, a processor can always detect 1) is it an encoding it can read?, and 2) is it an encoding it can use? I don't see why there is any need to ban the BOM for UTF16LE and UTF16BE. RFC 2871 puts on an unnessary burdon here. But even if it is banned, it does not make autodection unreliable. > Now, let us suppose that we allow UTF-16LE/BE for XML. Then, what > will happen? XML document entities, external parameter entities, and > external DTD subsets begin with predictable character sequences > such as "<?xml". However, external parsed entities are allowed to > begin with *any* character. Therefore, if the BOM is absent, we > cannot reliably detect UTF-16 external parsed entities. As in my email responding to John Cowen, where did the WG get the idea that an external parseable entity can begin with any character? Entity handling occurs before parsing. This seems a major change and incorrect against the parsing model of XML. > If we decide to allow UTF-16LE/BE for XML, we have to publish > a new RFC that supersedes RFC 2376, and to publish a new version > of XML. Then, the sentence should be deleted and the autodetection > algorithm should be significantly revised so as to handle > encoding declarations in UTF-16LE/BE correctly. Why? It is just another encoding. Why cannot this be handled merely by updating Appendix F? > If we decide to disallow UTF-16LE/BE for XML, we can simply > delete the sentence or may want to revise is as below: > > When external parsed entities are encoded in UTF-16LE/BE (and thus, > strictly speaking, in error), this autodetection does not work. Again, my concern is that the WG is saying "this autodection" (i.e. the specific algorithm in Appendix F as far as it goes) but thinking "any autodetection". I still have not seen any evidence why it is an error against XML 1.0, strictly speaking, for an external parser entity to be encoded in UTF16LE/BE if it has an encoding declarations (whether or not it has a BOM). Rick Jelliffe
Received on Wednesday, 5 April 2000 15:36:47 UTC