RE: Modified DTDs from Nick Kew on 2004-08-06 (www-validator@w3.org from August 2004)

From: Nick Kew <nick@webthing.com>
Date: Fri, 6 Aug 2004 20:28:03 +0100 (BST)
To: "Jukka K. Korpela" <jkorpela@cs.tut.fi>
Cc: clong@itlnet.net, www-validator@w3.org
Message-ID: <Pine.LNX.4.53.0408062011140.970@hugin.webthing.com>

On Fri, 6 Aug 2004, Jukka K. Korpela wrote:

> So the validator is unable to process even the DTD correctly.
> I guess the same happens on the W3C validator, with much worse
> error recovery.

Or rather, errors in the DTD are suppressed in the report.  Bear in
mind that failing to do so can have unfortunate side-effects, such
as confusing users by reporting four warnings in the HTML 4.0 DTD
(corrected in 4.01).

> And in fact
> <http://www.htmlhelp.com/cgi-bin/validate.cgi?
> url=http%3A%2F%2Fwww.billnchimene.com%2Findex.html&warnings=yes&xml=yes>
> tells that the document passes validation.

That's the key observation.  The XML flags causes the parser to deal with
XML syntax (subject to some known limitations).

> Apparently the problem is that a validator needs to be told, or it needs
> to guess, whether it is performing the job of an SGML validator or the job
> of an XML validator.

Indeed.  The HTTP headers tell it that.

>	 With predefined, catalogued DTDs, they presumably use
> the FPI or the URL to resolve this.

Nope.  Well, yes, there's Appendix C which b*****s up believing the
headers, but that's a specific exception that can be detected by
matching specific strings.

> But my analysis might be partly wrong. This is all very confusing, since
> validators, believed to perform a well-defined rigorous check, actually
> play fast and loose and "heuristically".

Appendix C is the spec playing fast and loose, not the validator.
In this case, it simply took on trust that the document was HTML.
Based on that it got a DTD that doesn't parse.

The error recovery adopted here appears to be fallback to a default.
How best to report that is indeed an issue: since you regularly complain
of confusing messages, perhaps you'd like to suggest a fix?

Page Valet does the same, but gives an additional system message alerting
the user to the mismatch between the HTML claim and the XML document.
That's my best stab at the problem, but could also doubtless benefit
from further improvement.

-- 
Nick Kew

Received on Friday, 6 August 2004 15:28:59 UTC