[PATCH] check_for_doctype

('binary' encoding is not supported, stored as-is)
Here is a patch for the check_for_doctype code. It has some fairly
insignificant performance improvements, but reduces memory consumption by a
little more then the size of the file being validated. It includes the fix
for the greedy regex that barfed on a commented out DOCTYPE within 5 lines
after the real DOCTYPE. It also enforces the use of matching quote
characters (i.e. either single or double quotes, but the opening and
closing one must be of the same type). Rewritten in more "ideomatic" Perl
and Gerald will probably kill me for the indentation changes. :-)


I'm not hearing anyone yell loudly at the idea of killing the DOCTYPE
guessing feature, but I'm also not sure of how to interpret that. Gross
disinterest or insufficient information?


I've been looking at the HTML::Parser module (part of LWP) and it looks
like we could do some pretty nifty stuff with it if we could kill the
guessing code in favour of a DOCTYPE override. I suppose the guessing code
could still be supported, but that would be even messier then it is today.

The idea behind using HTML::Parser is that we can dump the DOCTYPE
extraction (for documents that have a DOCTYPE) on it instead of rolling our
own. We lose some control in the short term, but gain cleaner code and make
a whole bunch of new features much easier to implement. The two things off
the top my head are link checking (A and IMG) and the DOCTYPE override.

There is a lot more we could favourably use HTML::Parser for if it proves
up to the task. Outlines and incremental processing spring to mind.

Received on Wednesday, 10 November 1999 21:26:35 UTC