Encoding heuristics for parse="text"

I've recently implemented some of the encoding heuristics for parse="text" described in the last call working draft. (See http://www.ibiblio.org/xml/XInclude/) and I have some notes based on that experience. I am concerned that the proper treatment of the files loaded when parse="text" still is not fully thought out, and does not address some very reasonable desires. 

1. Encoding heuristics such as a byte order mark are only recognized for documents whose MIME type indicates they're XML. Shouldn't these be recognized for non-XML Unicode documents as well? 

2. The spec allows (perhaps requires?) an XML encoding declaration to specify the encoding. I think an HTML META tag should be allowed as well if the XInclude engine wants to recognize that syntax.

3. There are other formats which may include in-document means of specifying the encoding, as do HTML or XML. I think processors should be allowed to recognize these. 

4. I do not think any of these should be required. In particular, I don't think that recognizing the XML encoding declaration should be required. This imposes a significant burden on implementers when considered in its full generality. 

5. There are many environments in which full MIME types are not available, and yet nonetheless a document can be identified as XML; e.g. when reading from a standard Windows or Unix file system. As with MIME types, misidentification is always possible. However, chances are very good that a *.xml document is in fact an XML document. This should be allowed to be interpreted as XML. (You could argue that this is already allowed by "external encoding information". However, in this case it's a combination of external and internal encoding information that's needed.) 

6. You specify that for XML documents, "the encoding is recognized as specified in XML 1.0". However, the appendix of the XML 1.0 specification that really describes this is non-normative. I think a normative spelling out of the rules is needed in XInclude.

I think there are basically two ways to resolve all this. One is to consider many more options for recognition and spell them out in the spec. The alternative, simpler approach, is to spell out the possibilities in a non-normative appendix but explicitly allow implementations to use any information they can get from anywhere to attempt better guesses at the proper encoding without requiring them to support any particular means except the encoding attribute. I prefer the second option. 
-- 

+-----------------------+------------------------+-------------------+
| Elliotte Rusty Harold | elharo@metalab.unc.edu | Writer/Programmer |
+-----------------------+------------------------+-------------------+ 
|          The XML Bible, 2nd Edition (Hungry Minds, 2001)           |
|              http://www.ibiblio.org/xml/books/bible2/              |
|   http://www.amazon.com/exec/obidos/ISBN=0764547607/cafeaulaitA/   |
+----------------------------------+---------------------------------+
|  Read Cafe au Lait for Java News:  http://www.cafeaulait.org/      | 
|  Read Cafe con Leche for XML News: http://www.ibiblio.org/xml/     |
+----------------------------------+---------------------------------+

Received on Sunday, 2 September 2001 11:15:05 UTC