Re: HTML entities and the validator... from sierkb@gmx.de on 2011-04-23 (www-validator@w3.org from April 2011)

From: <sierkb@gmx.de>
Date: Sat, 23 Apr 2011 23:09:39 +0200
To: public-qa-dev Dev <public-qa-dev@w3.org>
Cc: www-validator Community <www-validator@w3.org>
Message-Id: <0365BDA1-F578-458B-9B05-1677567ABBF8@gmx.de>

Am 23.04.2011 um 20:28 schrieb Jukka K. Korpela:
> sierkb@gmx.de wrote:
> 
>> Question: is there, by any means, anywhere, a definition, if in HTML
>> (concrete: HTML 4.01) and/or it's parent, SGML, it's allowed and
>> valid to shorten an entity (like &amp;) to "&;"
> 
> It is not.

That's what is the essence of my question. If it's not allowed and not valid, why does the W3C validator let pass it and says "valid"?

> Non sequitur. "&;" is not a shortened notation of an entity. Whether it is valid is a different question-

I know, that writing "&;" is not an official accepted shortened notation of any entity and that entities cannot be shortened per definition. I said it with that intention in mind: so to say "shortened" by the author instead of writing it correctly.

>> When NOT valid, why does
>> the W3C Markup Validator say so, while parsing/validating against
>> HTML 4.01 Strict
> 
> In HTML 4.01, by the formal specifications, SGML rules apply,

Yes. That' clear and not in question.

> so an "&" character simply denotes itself when it is not followed by a NAME character, and ";" is not a NAME character.

Again my question: is the validator correct or wrong in letting pass such a "&;" construct concerning HTML 4.01?

> In XHTML, XML rules apply, and XML never allows an "&" character except as the initial character of a character reference or an entity reference.

Yes. If parsed as XML by the XML parser and not as SGML by the SGML parser, affected by the Mime type which has the role as a switch. Am I right?

>> while sticking to the
>> Mimetype text/html)?
> 
> The MIME type does not matter here.

As far as I know, the MIME type is essential for the decision, if XHTML 1.0 gets parsed by the XML parser or gets parsed by the SGML parser. The former more severe than the latter.

> The HTML5 rules more or less reflect the SGML rules. You can see this if you try "& ;" (i.e., ampersand, space, semicolon) - it passes.
> The W3C Markup validator rejects "&;" in HTML5 mode for some reason that I cannot figure out, as I can find no prohibition against it.

So, "&;" is a valid notation in HTML 4.01, XHTML 1.0 and HTML 5? Or not valid? Or is it not valid but tolerated? If valid, then why does the validator differ in the results? And if not valid, why does it either differ in the results?
_Is_ it a bug of the validator to present different results in handling "&;" (ampersand, semicolon), when validating against HTML 4.1, XHTML 1.0 or HTML5? Or is it _not_ a bug to differ in the results? Or is it not valid and no bug of the validator but tolerated by the validator? That's the main question.

Regards,
Sierk

Received on Saturday, 23 April 2011 21:10:14 UTC