Re: I18N issues with the XML Specification from Rick Jelliffe on 2000-04-05 (xml-editor@w3.org from April to June 2000)

From: Rick Jelliffe <ricko@gate.sinica.edu.tw>
Date: Thu, 6 Apr 2000 03:36:28 +0800 (CST)
To: MURATA Makoto <muraw3c@attglobal.net>
cc: xml-editor@w3.org, w3c-i18n-ig@w3.org, w3c-xml-core-wg@w3.org
Message-ID: <Pine.GSO.4.21.0004060308550.21048-100000@gate>

On Wed, 5 Apr 2000, MURATA Makoto wrote:

> In message "Re: I18N issues with the XML Specification",
> Rick Jelliffe wrote...

>  >To put it another way, if a character encoding cannot reliably be
>  >autodetected, it should be banned from being used with XML. But I have
>  >still yet to find any encodings that fit into this category. 
> 
> In RFC 2781 (UTF-16, an encoding of ISO 10646), we have three dialects 
> of UTF-16.  Their charset names are "utf-16", "utf-16le" (BOM-less 
> little endian), and "utf-16be" (BOM-less big endian).

If several encodings share a (BOM, etc) signature, then the XML header
must be specified.  When the XML header is specified, a processor can
always detect
 1) is it an encoding it can read?, and
 2) is it an encoding it can use? 

I don't see why there is any need to ban the BOM for UTF16LE and
UTF16BE. RFC 2871 puts on an unnessary burdon here. But even if
it is banned, it does not make autodection unreliable.

> Now, let us suppose that we allow UTF-16LE/BE for XML.  Then, what 
> will happen?  XML document entities, external parameter entities, and 
> external DTD subsets begin with predictable character sequences 
> such as "<?xml".  However, external parsed entities are allowed to 
> begin with *any* character.  Therefore, if the BOM is absent, we 
> cannot reliably detect UTF-16 external parsed entities.

As in my email responding to John Cowen, where did the WG get the idea
that an external parseable entity can begin with any character?  Entity
handling occurs before parsing.  This seems a major change and incorrect
against the parsing model of XML.

> If we decide to allow UTF-16LE/BE for XML, we have to publish 
> a new RFC that supersedes RFC 2376, and to publish a new version 
> of XML.  Then, the sentence should be deleted and the autodetection 
> algorithm should be significantly revised so as to handle 
> encoding declarations in UTF-16LE/BE correctly.

Why?  It is just another encoding. Why cannot this be handled merely
by updating Appendix F?

> If we decide to disallow UTF-16LE/BE for XML, we can simply 
> delete the sentence or may want to revise is as below:
> 
> 	When external parsed entities are encoded in UTF-16LE/BE (and thus, 
> 	strictly speaking, in error), this autodetection does not work.

Again, my concern is that the WG is saying "this autodection" (i.e. the
specific algorithm in Appendix F as far as it goes) but thinking "any
autodetection". I still have not seen any evidence why it is an error
against XML 1.0, strictly speaking, for an external parser entity to be
encoded in UTF16LE/BE if it has an encoding declarations (whether or not
it has a BOM).

Rick Jelliffe

Received on Wednesday, 5 April 2000 15:36:47 UTC