Re: Unable to Determine Parse Mode and other related problems from Jukka K. Korpela on 2013-05-10 (www-validator@w3.org from May 2013)

From: Jukka K. Korpela <jkorpela@cs.tut.fi>
Date: Fri, 10 May 2013 08:59:25 +0300
To: www-validator@w3.org
Message-ID: <518C8CBD.1090507@cs.tut.fi>
2013-05-10 2:40, Brian Barnett wrote:

> One URL I test gives the “Unable to Determine Parse Mode!” warning as
> well as the
> “/Line//_1_/<http://validator.w3.org/check?uri=http%3A%2F%2Ftest1.calcxml.com%2Fcalculators%2Fhome-affordability%3Fskn%3D502&charset=%28detect+automatically%29&doctype=Inline&ss=1&outline=1&group=0&No200=1&verbose=1&st=1&user-agent=W3C_Validator%2F1.3+http%3A%2F%2Fvalidator.w3.org%2Fservices>/,
> Column 1/:end of document in prolog”while a nearly identical URL(HTML
> nearly identical) succeeds without warnings or errors.

Perhaps the most absurd part of this is that the validator announces at 
the start “Error found while checking this document as HTML 4.01 
Transitional!”, yet later says “Unable to Determine Parse Mode!” There 
is of course only one possible parse mode for HTML 4.01 (SGML parsing). 
The reason is that the validator tries to be informative by identifying 
the “document type” prominently at the start, but that information is 
often absurd.

Anyway, the problem with the page is that there is some character data 
before the <!DOCTYPE ...> thing.

> Invalid
> URL:_http://test1.calcxml.com/calculators/home-affordability?skn=502_

Technically, the validator is not really saying that the document is 
invalid. Rather, that it is outside the scope of validation: validation 
proper wasn’t even started.

If you View Source in a browser, or save the page locally, you will not 
see anything special at the start. The reason is that browsers silently 
remove the character data before <!DOCTYPE ...>. But if you access the 
page with
http://www.rexswain.com/httpview.html
so that on that page, “Display Format” has been set to “Hex”, you will 
see that the server actually sends data that begins in hex with 
“3737390D0A”. In UTF-8, the declared encoding, these mean the 
three-digit string “779” followed by a line break, as Carriage Return, 
Line Feed (U+000D U+000A). How they get there can only be known by 
analyzing what the server is doing.

Strangely, if I submit such a document to validation via File Upload, I 
get the much more understandable error message.

QUOTE
Error Line 1, Column 1: character "7" not allowed in prolog

779
UNQUOTE

But when processing a server response, the validator seems to get confused.

For comparison, validating by URL, http://validator.nu starts its 
message as follows (and http://validator.w3.org/nu/ does the same):

QUOTE
Info: The Content-Type was text/html. Using the HTML parser.

Error: End of file seen without seeing a doctype first. Expected e.g. 
<!DOCTYPE html>.
UNQUOTE

This is a bit more informative, but it raises the question why the 
validator scans the rest of the document (without parsing any tags) in 
search for a <!DOCTYPE> string, without seeing it while it clearly is there.

> 1. Why is it unable to determine the parse mode?

Because it failed to note the <!DOCTYPE ...> due to character data 
before it. *And* because validator.w3.org, unlike the apparently 
improved validator.w3.org/nu and validator.nu, thinks that the parse 
mode must not be defaulted to SGML parsing even though the media type is 
declared text/html.

> 2. Why does it think the html is empty?

Apparently because the character data at the start confuses it so that 
it reads past the entire document looking for <!DOCTYPE> (which is there 
but gets unnoticed).

> If you browse to the invalid URL in a browser (except IE10 I believe,
> which is why I am trying to figure this out), you will see that the
> invalid URL does render and indeed returns html.

On IE 10, the page looks completely empty, and doing View Source, I see just

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML><HEAD>
<META http-equiv="Content-Type" content="text/html; charset=utf-8"></HEAD>
<BODY></BODY></HTML>

So apparently IE 10 has an issue with the character data before 
<!DOCTYPE>, rather similar to the issue that the validator has. It 
presumably does not recognize anything in the document – it just looks 
for <!DOCTYPE> without finding one, encounters end of data, thereby 
getting an empty document, and then it constructs HTML markup for it. 
Notice the 4.0 doctype without URL as opposite to the actual 4.01 
doctype with URL in the document.

Yucca
Received on Friday, 10 May 2013 05:59:54 UTC