Lack of 'fatal error' tests for invalid encodings from Leif Halvard Silli on 2011-06-09 (public-xml-testsuite@w3.org from June 2011)

From: Leif Halvard Silli <xn--mlform-iua@xn--mlform-iua.no>
Date: Thu, 9 Jun 2011 03:32:43 +0200
To: public-xml-testsuite@w3.org
Cc: John Cowan <cowan@mercury.ccil.org>
Message-ID: <20110609033243875895.0f711adc@xn--mlform-iua.no>
Per a discussion with John Cowan,[1][2] it seems reaonsable to 
conclude, 

	 FIRSTLY, that the XML test suite is lacking many relevant
              encoding tests
	SECONDLY, that there is shortage of tests where there is 
              external encoding information (read: HTTP).
     THIRDLY, as a result, many testable 'fatal error' situations
              described in XML 1.0 do not have tests.

Practical Questions:

 1) Dp I seend the test cases that I see needed directly to this list?
 2) Can we create some specific HTTP tests, online?

Test Descriptions:

All the bugs/tests I have im mind tend to be related to the UTF-8 BOM. 
To illustrate the kind of tests/bugs, here are some bugs in the RXP 
parser:

1) HTTP bugs

* Parsers must obey the charset parameter in the Content-Type: 
  header to the extent that they must ignore the BOM  and the 
  encoding declaratation when they determine the encoding.
  Thus, if the charset parameter is incorrect, parsers should emit
  en fatal error.

  But RXP simply ignores the HTTP Content-Type: charset parameter.
  Instead RXP treats HTTP served files as if they were files on
  the hard disk. As a result, RXP fails to emit 'fatal error'
  if for instance HTTP says "ISO-8859-1" when the served document
  is UTF-8 and with the BOM. 

  (Of course, if the Content-Type header does not have a charset
  parameter, then the charset must be determined as if it was
  located on the harddisk.)

2) File bugs

* Parsers must omit a 'fatal error' if the BOM disagree with the  
  XML encoding declaration.

  But for a UTF-8 encoded file with the BOM but which has been
  labeled with <?xml version="1.0" encoding="ISO-8859-5"?>, RXP 
  simply ignores the BOM and reports the file to be encoded as
  ISO-8859-5.

Conclusion: RXP accepts non-well-formed XML files.

As a matter of fact, errors related to the UTF-8 BOM are common. I have 
so far not found a single parser which emits a 'fatal error' if there 
is a UTF-8 BOM which conflicts with either the charset parameter of the 
Content-Type: header or with the XML encoding declaration. The parsers 
in the common Web browsers tend to obey the BOM and ignore both HTTP 
and the encoding declaration. Whereas RXP and the XMLmind editor 
instead seem to ignore the BOM.

The errors are so common that perhaps XML 1.0 should be changed? 
Effectively, that is what I have proposed in bug 12897 agains the HTML5 
specification. [3] If XML 1.0 were to change so that these currenlty 
non-well-formed document are considered well-formed, then there are two 
choice: the RXP behaviour (= ignoring the BOM) or the behavior that Web 
browsers show (adhering to the BOM). The most logical thing seems to 
change what is in XML's domain, and not to touch the BOM.

Thus, my proposal is that XML parsers MUST ignore the HTTP charset 
parameter as well as the XML encoding declartion *if* the document 
begins with the UTF-8 BOM. It seems to me that that this will be more 
fruitful than the current rules which are broken the one way (RXP) or 
the other (web browsers). There is also already a presedence for 
ignoring the XML encodign declaration. Namely, it must be ignored if 
HTTP says so. And finally, I believe there is prospect for better 
convergance with HTML, if the UTF-8 BOM always has priority.

Of course, it is not this mailinglsit which eventually effects this 
change to XML 1.0 revsion 6 ... Nevertheless it is worth keeping these 
things in mind.

[1] 
http://lists.w3.org/Archives/Public/www-international/2011AprJun/0094
[2] 
http://lists.w3.org/Archives/Public/www-international/2011AprJun/0095
[3] http://www.w3.org/Bugs/Public/show_bug.cgi?id=12897
-- 
Leif Halvard Silli
Received on Thursday, 9 June 2011 06:19:08 UTC