WWW-Validator Bug (response to private mail on other topic)

From: Earl Hood (ehood@hydra.acs.uci.edu)
Date: Wed, Jun 16 1999


Message-Id: <199906160906.CAA03110@geneva.acs.uci.edu>
To: www-validator@w3.org
Date: Wed, 16 Jun 1999 02:06:03 -0700
From: Earl Hood <ehood@hydra.acs.uci.edu>
Subject: WWW-Validator Bug (response to private mail on other topic)

On June 15, 1999 at 23:57, someone wrote:

> Wow, I would have never thought to question their
> validator.

Not your fault. You either need a background in SGML or read
the HTML 4.0 spec very carefully.

I just checked the source (cgi-bin/check) of their validator, and I
spot their error.  The bug is in the check_for_doctype() function.  The
check for a doctype declaration is not robust enough to deal with
leading comment declarations that could contain "tag" like data.  They
have the following statement:

	last if ( $line =~ /<[a-z]/i );		 # found an element

However, it does not take in account that it could be inside
of a comment declaration.

Dealing with comment declarations can be ugly since the program reads
the data into an array instead of keeping it in a single scalar string
(I'm unclear why the document is split into an array).  If the data is
passed in as a single string, a comment stripping regex:

	s/<!--([^-]|-[^-])*--\s*>//go;

Could first be applied before checking for a doctype declaration.

Another possible solution is to call nsgmls first and see if
it complains about a missing document type.   One has to be
careful if dealing with an XML document since the XML
SGML declaration needs to be passed to nsgmls for parsing (to
avoid invalid character and other errors).  However, a simple
pattern match checking for XML specific markup could be used to
determine if XML-related arguments to nsgmls are needed.

	--ewh

----
             Earl Hood              | University of California: Irvine
      ehood@medusa.acs.uci.edu      |      Electronic Loiterer
http://www.oac.uci.edu/indiv/ehood/ | Dabbler of SGML/WWW/Perl/MIME