- From: Leif Halvard Silli <xn--mlform-iua@xn--mlform-iua.no>
- Date: Thu, 9 Jun 2011 03:32:43 +0200
- To: public-xml-testsuite@w3.org
- Cc: John Cowan <cowan@mercury.ccil.org>
Per a discussion with John Cowan,[1][2] it seems reaonsable to
conclude,
FIRSTLY, that the XML test suite is lacking many relevant
encoding tests
SECONDLY, that there is shortage of tests where there is
external encoding information (read: HTTP).
THIRDLY, as a result, many testable 'fatal error' situations
described in XML 1.0 do not have tests.
Practical Questions:
1) Dp I seend the test cases that I see needed directly to this list?
2) Can we create some specific HTTP tests, online?
Test Descriptions:
All the bugs/tests I have im mind tend to be related to the UTF-8 BOM.
To illustrate the kind of tests/bugs, here are some bugs in the RXP
parser:
1) HTTP bugs
* Parsers must obey the charset parameter in the Content-Type:
header to the extent that they must ignore the BOM and the
encoding declaratation when they determine the encoding.
Thus, if the charset parameter is incorrect, parsers should emit
en fatal error.
But RXP simply ignores the HTTP Content-Type: charset parameter.
Instead RXP treats HTTP served files as if they were files on
the hard disk. As a result, RXP fails to emit 'fatal error'
if for instance HTTP says "ISO-8859-1" when the served document
is UTF-8 and with the BOM.
(Of course, if the Content-Type header does not have a charset
parameter, then the charset must be determined as if it was
located on the harddisk.)
2) File bugs
* Parsers must omit a 'fatal error' if the BOM disagree with the
XML encoding declaration.
But for a UTF-8 encoded file with the BOM but which has been
labeled with <?xml version="1.0" encoding="ISO-8859-5"?>, RXP
simply ignores the BOM and reports the file to be encoded as
ISO-8859-5.
Conclusion: RXP accepts non-well-formed XML files.
As a matter of fact, errors related to the UTF-8 BOM are common. I have
so far not found a single parser which emits a 'fatal error' if there
is a UTF-8 BOM which conflicts with either the charset parameter of the
Content-Type: header or with the XML encoding declaration. The parsers
in the common Web browsers tend to obey the BOM and ignore both HTTP
and the encoding declaration. Whereas RXP and the XMLmind editor
instead seem to ignore the BOM.
The errors are so common that perhaps XML 1.0 should be changed?
Effectively, that is what I have proposed in bug 12897 agains the HTML5
specification. [3] If XML 1.0 were to change so that these currenlty
non-well-formed document are considered well-formed, then there are two
choice: the RXP behaviour (= ignoring the BOM) or the behavior that Web
browsers show (adhering to the BOM). The most logical thing seems to
change what is in XML's domain, and not to touch the BOM.
Thus, my proposal is that XML parsers MUST ignore the HTTP charset
parameter as well as the XML encoding declartion *if* the document
begins with the UTF-8 BOM. It seems to me that that this will be more
fruitful than the current rules which are broken the one way (RXP) or
the other (web browsers). There is also already a presedence for
ignoring the XML encodign declaration. Namely, it must be ignored if
HTTP says so. And finally, I believe there is prospect for better
convergance with HTML, if the UTF-8 BOM always has priority.
Of course, it is not this mailinglsit which eventually effects this
change to XML 1.0 revsion 6 ... Nevertheless it is worth keeping these
things in mind.
[1]
http://lists.w3.org/Archives/Public/www-international/2011AprJun/0094
[2]
http://lists.w3.org/Archives/Public/www-international/2011AprJun/0095
[3] http://www.w3.org/Bugs/Public/show_bug.cgi?id=12897
--
Leif Halvard Silli
Received on Thursday, 9 June 2011 06:19:08 UTC