Re: HTML entities and the validator... from sierkb@gmx.de on 2011-04-24 (public-qa-dev@w3.org from April 2011)

From: <sierkb@gmx.de>
Date: Sun, 24 Apr 2011 02:33:06 +0200
To: peasthope@shaw.ca
Cc: public-qa-dev@w3.org
Message-Id: <ECB9670A-289C-4571-9276-B7406EB62EB7@gmx.de>
Am 24.04.2011 um 02:09 schrieb peasthope@shaw.ca:
> I may not understand properly but might have something helpful.
> [..]
> According to RFCs 2396 and 3986, "&" must be % encoded when 
> not serving in a reserved role in a URI.  So I wonder whether 
> your error message from the validator is a consequence of the 
> "&" not being % escaped.
> 
> Ref. http://en.wikipedia.org/wiki/Uniform_Resource_Identifier


Thanks for the hint, Peter. But I think, that doesn't cover exactly, what I mean and what my question is about.
My problem of understanding _is not_, that an error message _occurs_. My problem of understanding is, why an error message _does NOT_ occur, when parsed as HTML 4.01 and in contrast _does_ occur, when parsed as XHTML 1.0 (with content type 'text/html') or HTML5 (with content type 'text/html').
Either the character sequence "&;" is valid, or it is not valid. Or the validator is at least tolerant but inconsistent in handling this. But not both at the same time -- valid (or at least tolerated) when parsed against the HTML 4.01DTD and invalid and with error message, when parsed against the XHTML 1.0 DTD (as far as I know, parsed through the SGML parser, when content type is "text/html" rather than "application/xhtml+xml", application/xml or text/xml) or against HTML5.

So, why are there differences in the results (as far as I know, when XHTML 1.0 is parsed as "text/html", then it is parsed NOT by the XML parser but by the SGML parser -- the very same SGML parser, that parses HTML 4.01)?
Is this difference correct, and am I blind and optuse to not see the explanation, or does the validator has a bug somewhere that prevents it to throw either an error constantly 3 times in each constellation, or lets pass the code "&;" constantly three times.
So, what is correct at the bottom line? Letting pass "&;" and saying "valid/passed/green"? Or not letting pass "&;" and saying "not valid/error/red".
My embarrassment and surprise comes from the fact, that 2 times the character string "&;" produces a "not valid/error/red" and 1 time gets a "valid/passed/green"! Why not 3 times a consistent result à la "valid/passed/green" or 3 times a "not valid/error/red"? So what do I miss to see? Or might be there in fact a bug in the validator's behavior, that get's disclosed with this?

I know, that the HTML5 parser is a separate parser and might be seen separate. But relating HTML 4 and XHTML 1.0: as far I know (pleas correct me, if I'm false), HTML 4.01 and XHTML 1.0 are parsed through the same SGML parser, when (and only then) the XHTML code has got the content type text/html (having or be served with the recommended XHTML Mimetype application/xhtml+xml or application/xml or text/xml, the much more rigorous XML parser is used). So, in this constellation and concerning the character string "&;": one and the very same parser but two different results. One result "valid/passed/green". The other result "not valid/error/red". Is this behavior, this inconsistency intentional and correct, or is this inconsistency perhaps a bug of the validator? My opinion so far is, that letting pass the character string "&;" and resulting in "valid/passed/green" _should not occur at all_! But it occurs. One time. And two times not. And one time of the two times with one and the same parser.

You see, what I mean?

Regards,
Sierk
Received on Sunday, 24 April 2011 00:33:43 UTC