- From: Jukka K. Korpela <jkorpela@cs.tut.fi>
- Date: Fri, 10 May 2013 08:59:25 +0300
- To: www-validator@w3.org
2013-05-10 2:40, Brian Barnett wrote: > One URL I test gives the “Unable to Determine Parse Mode!” warning as > well as the > “/Line//_1_/<http://validator.w3.org/check?uri=http%3A%2F%2Ftest1.calcxml.com%2Fcalculators%2Fhome-affordability%3Fskn%3D502&charset=%28detect+automatically%29&doctype=Inline&ss=1&outline=1&group=0&No200=1&verbose=1&st=1&user-agent=W3C_Validator%2F1.3+http%3A%2F%2Fvalidator.w3.org%2Fservices>/, > Column 1/:end of document in prolog”while a nearly identical URL(HTML > nearly identical) succeeds without warnings or errors. Perhaps the most absurd part of this is that the validator announces at the start “Error found while checking this document as HTML 4.01 Transitional!”, yet later says “Unable to Determine Parse Mode!” There is of course only one possible parse mode for HTML 4.01 (SGML parsing). The reason is that the validator tries to be informative by identifying the “document type” prominently at the start, but that information is often absurd. Anyway, the problem with the page is that there is some character data before the <!DOCTYPE ...> thing. > Invalid > URL:_http://test1.calcxml.com/calculators/home-affordability?skn=502_ Technically, the validator is not really saying that the document is invalid. Rather, that it is outside the scope of validation: validation proper wasn’t even started. If you View Source in a browser, or save the page locally, you will not see anything special at the start. The reason is that browsers silently remove the character data before <!DOCTYPE ...>. But if you access the page with http://www.rexswain.com/httpview.html so that on that page, “Display Format” has been set to “Hex”, you will see that the server actually sends data that begins in hex with “3737390D0A”. In UTF-8, the declared encoding, these mean the three-digit string “779” followed by a line break, as Carriage Return, Line Feed (U+000D U+000A). How they get there can only be known by analyzing what the server is doing. Strangely, if I submit such a document to validation via File Upload, I get the much more understandable error message. QUOTE Error Line 1, Column 1: character "7" not allowed in prolog 779 UNQUOTE But when processing a server response, the validator seems to get confused. For comparison, validating by URL, http://validator.nu starts its message as follows (and http://validator.w3.org/nu/ does the same): QUOTE Info: The Content-Type was text/html. Using the HTML parser. Error: End of file seen without seeing a doctype first. Expected e.g. <!DOCTYPE html>. UNQUOTE This is a bit more informative, but it raises the question why the validator scans the rest of the document (without parsing any tags) in search for a <!DOCTYPE> string, without seeing it while it clearly is there. > 1. Why is it unable to determine the parse mode? Because it failed to note the <!DOCTYPE ...> due to character data before it. *And* because validator.w3.org, unlike the apparently improved validator.w3.org/nu and validator.nu, thinks that the parse mode must not be defaulted to SGML parsing even though the media type is declared text/html. > 2. Why does it think the html is empty? Apparently because the character data at the start confuses it so that it reads past the entire document looking for <!DOCTYPE> (which is there but gets unnoticed). > If you browse to the invalid URL in a browser (except IE10 I believe, > which is why I am trying to figure this out), you will see that the > invalid URL does render and indeed returns html. On IE 10, the page looks completely empty, and doing View Source, I see just <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"> <HTML><HEAD> <META http-equiv="Content-Type" content="text/html; charset=utf-8"></HEAD> <BODY></BODY></HTML> So apparently IE 10 has an issue with the character data before <!DOCTYPE>, rather similar to the issue that the validator has. It presumably does not recognize anything in the document – it just looks for <!DOCTYPE> without finding one, encounters end of data, thereby getting an empty document, and then it constructs HTML markup for it. Notice the 4.0 doctype without URL as opposite to the actual 4.01 doctype with URL in the document. Yucca
Received on Friday, 10 May 2013 05:59:54 UTC