Re: HTML entities and the validator... from Jukka K. Korpela on 2011-04-24 (www-validator@w3.org from April 2011)

From: Jukka K. Korpela <jkorpela@cs.tut.fi>
Date: Sun, 24 Apr 2011 07:45:07 +0300
To: "public-qa-dev Dev" <public-qa-dev@w3.org>
Cc: "www-validator Community" <www-validator@w3.org>
Message-ID: <21A7B2C78E0F4B469AFE6CF67D913104@JukanPC>

sierkb@gmx.de wrote:

> Am 23.04.2011 um 20:28 schrieb Jukka K. Korpela:
>> sierkb@gmx.de wrote:
>>
>>> Question: is there, by any means, anywhere, a definition, if in HTML
>>> (concrete: HTML 4.01) and/or it's parent, SGML, it's allowed and
>>> valid to shorten an entity (like &amp;) to "&;"
>>
>> It is not.
>
> That's what is the essence of my question. If it's not allowed and
> not valid, why does the W3C validator let pass it and says "valid"?

You asked whether it is allowed and valid to shorten an entity to "&;". 
Similarly, it is not allowed to shorten the entity (more correctly, entity 
reference) "&amp;" to ";" or "amp" or "a". This does not imply that any of 
those strings would not be permitted.

>> In HTML 4.01, by the formal specifications, SGML rules apply,
>
> Yes. That' clear and not in question.
>
>> so an "&" character simply denotes itself when it is not followed by
>> a NAME character, and ";" is not a NAME character.
>
> Again my question: is the validator correct or wrong in letting pass
> such a "&;" construct concerning HTML 4.01?

Correct, as there is nothing in SGML rules that would disallow it. The 
statement in HTML 4.01 spec that says that authors should use "&amp;" 
instead of "&" in text and in attribute values is not part of the formalized 
syntax that determines what is valid. (And it is not even a prose 
requirement, just a recommendation; "should", not "shall".)

> Yes. If parsed as XML by the XML parser and not as SGML by the SGML
> parser, affected by the Mime type which has the role as a switch. Am
> I right?

Validity does not depend on MIME types. If you have an XHTML document, then 
its validity is decided on by XML rules and the document type definition, 
without any SGML rules stepping in. Serving an XHTML 1.0 document as 
text/html may well make browsers process it as if i were legacy tag-soup 
HTML, but that's an entirely different thing.

>> The W3C Markup validator rejects "&;" in HTML5 mode for some reason
>> that I cannot figure out, as I can find no prohibition against it.
>
> So, "&;" is a valid notation in HTML 4.01, XHTML 1.0 and HTML 5? Or
> not valid? Or is it not valid but tolerated? If valid, then why does
> the validator differ in the results? And if not valid, why does it
> either differ in the results?

I'm not sure whether it helps to repeat the answers, but "&;" is valid HTML 
4.01 (not a "notation" really, just two characters, invalid in XHTML 1.0 
(because "&" is only allowed as beginning an entity reference or a character 
reference), and presumably "valid" in HTML5 but mistakenly rejected by the 
experimental HTML5 checker built into W3C Markup Validator. Regarding HTML5, 
I'm not sure about the status, as I last checked it yesterday. And as there 
is no formalized description of HTML5 syntax, there is no concept of "valid" 
in the same sense as with SGML and XML.

> _Is_ it a bug of the validator to present different results in
> handling "&;" (ampersand, semicolon), when validating against HTML
> 4.1, XHTML 1.0 or HTML5?

Why would it be? The specifications differ.

> Or is it not valid and no bug of the validator but tolerated
> by the validator? That's the main question.

That's an odd question. Regarding SGML and XML validation, it is a bug in a 
validator if it does not report a markup error that violates the syntax 
(general SGML/XML rules or DTD rules). There is no such thing as 
"tolerating" such errors.

-- 
Yucca, http://www.cs.tut.fi/~jkorpela/

Received on Sunday, 24 April 2011 04:45:42 UTC