Should text/html be parsed as SGML or XML? from Nick Kew on 2001-10-08 (www-validator@w3.org from October 2001)

From: Nick Kew <nick@webthing.com>
Date: Mon, 8 Oct 2001 07:28:16 +0100 (BST)
To: www-validator@w3.org
Message-ID: <Pine.BSF.4.21.0110080639500.1366-100000@fenris.webthing.com>

[ if these URLs get wrapped, you'll need to unwrap them ]

Im the course of investigating error reports, I've just looked at
http://validator.w3.org/check?uri=http%3A%2F%2Fwww.w3.org%2FArchitecture%2Fqos.html

contrasted with
http://valet.webthing.com/page/val.cgi?url=http://www.w3.org/Architecture/qos.html

The offending document looks like:

<!doctype html>
<p> [ several mistyped links, but content that would be valid in
      an HTML <body> ]
...


The first error generated is of course "no internal or external document
type declaration subset; will parse without validation".  So of course
the report that follows depends on the default SGML declaration used
in such cases.

w3-validator (in common with the WDG validator) generates a longish list
of errors, from which it appears to be checking XML well-formed-ness.
Yet the page in question is served as text/html, which in my book
(and in particular those Site Valet tools that don't make this a
user option) should still be parsed as SGML, not XML.

I recollect reading some years ago in what I think was an official
W3C spec (probably for HTML 3.2 or 4.0) that for back-compatibility,
legacy documents should be parsed as HTML 2.0 in the absence of an
FPI.  Am I going senile, or has this been completely abandoned?

-- 
Nick Kew

Site Valet - the essential service for anyone with a website.
<URL:http://valet.webthing.com/>

Received on Monday, 8 October 2001 02:28:54 UTC