- From: Leif Halvard Silli <xn--mlform-iua@xn--mlform-iua.no>
- Date: Thu, 9 Jun 2011 03:32:43 +0200
- To: public-xml-testsuite@w3.org
- Cc: John Cowan <cowan@mercury.ccil.org>
Per a discussion with John Cowan,[1][2] it seems reaonsable to conclude, FIRSTLY, that the XML test suite is lacking many relevant encoding tests SECONDLY, that there is shortage of tests where there is external encoding information (read: HTTP). THIRDLY, as a result, many testable 'fatal error' situations described in XML 1.0 do not have tests. Practical Questions: 1) Dp I seend the test cases that I see needed directly to this list? 2) Can we create some specific HTTP tests, online? Test Descriptions: All the bugs/tests I have im mind tend to be related to the UTF-8 BOM. To illustrate the kind of tests/bugs, here are some bugs in the RXP parser: 1) HTTP bugs * Parsers must obey the charset parameter in the Content-Type: header to the extent that they must ignore the BOM and the encoding declaratation when they determine the encoding. Thus, if the charset parameter is incorrect, parsers should emit en fatal error. But RXP simply ignores the HTTP Content-Type: charset parameter. Instead RXP treats HTTP served files as if they were files on the hard disk. As a result, RXP fails to emit 'fatal error' if for instance HTTP says "ISO-8859-1" when the served document is UTF-8 and with the BOM. (Of course, if the Content-Type header does not have a charset parameter, then the charset must be determined as if it was located on the harddisk.) 2) File bugs * Parsers must omit a 'fatal error' if the BOM disagree with the XML encoding declaration. But for a UTF-8 encoded file with the BOM but which has been labeled with <?xml version="1.0" encoding="ISO-8859-5"?>, RXP simply ignores the BOM and reports the file to be encoded as ISO-8859-5. Conclusion: RXP accepts non-well-formed XML files. As a matter of fact, errors related to the UTF-8 BOM are common. I have so far not found a single parser which emits a 'fatal error' if there is a UTF-8 BOM which conflicts with either the charset parameter of the Content-Type: header or with the XML encoding declaration. The parsers in the common Web browsers tend to obey the BOM and ignore both HTTP and the encoding declaration. Whereas RXP and the XMLmind editor instead seem to ignore the BOM. The errors are so common that perhaps XML 1.0 should be changed? Effectively, that is what I have proposed in bug 12897 agains the HTML5 specification. [3] If XML 1.0 were to change so that these currenlty non-well-formed document are considered well-formed, then there are two choice: the RXP behaviour (= ignoring the BOM) or the behavior that Web browsers show (adhering to the BOM). The most logical thing seems to change what is in XML's domain, and not to touch the BOM. Thus, my proposal is that XML parsers MUST ignore the HTTP charset parameter as well as the XML encoding declartion *if* the document begins with the UTF-8 BOM. It seems to me that that this will be more fruitful than the current rules which are broken the one way (RXP) or the other (web browsers). There is also already a presedence for ignoring the XML encodign declaration. Namely, it must be ignored if HTTP says so. And finally, I believe there is prospect for better convergance with HTML, if the UTF-8 BOM always has priority. Of course, it is not this mailinglsit which eventually effects this change to XML 1.0 revsion 6 ... Nevertheless it is worth keeping these things in mind. [1] http://lists.w3.org/Archives/Public/www-international/2011AprJun/0094 [2] http://lists.w3.org/Archives/Public/www-international/2011AprJun/0095 [3] http://www.w3.org/Bugs/Public/show_bug.cgi?id=12897 -- Leif Halvard Silli
Received on Thursday, 9 June 2011 06:19:08 UTC