- From: Terje Bless <link@tss.no>
- Date: Wed, 10 Nov 1999 11:13:27 +0100
- To: W3C Validator <www-validator@w3.org>
- Message-Id: <199911110226.DAA09264@vals.intramed.rito.no>
('binary' encoding is not supported, stored as-is)
Here is a patch for the check_for_doctype code. It has some fairly insignificant performance improvements, but reduces memory consumption by a little more then the size of the file being validated. It includes the fix for the greedy regex that barfed on a commented out DOCTYPE within 5 lines after the real DOCTYPE. It also enforces the use of matching quote characters (i.e. either single or double quotes, but the opening and closing one must be of the same type). Rewritten in more "ideomatic" Perl and Gerald will probably kill me for the indentation changes. :-) I'm not hearing anyone yell loudly at the idea of killing the DOCTYPE guessing feature, but I'm also not sure of how to interpret that. Gross disinterest or insufficient information? I've been looking at the HTML::Parser module (part of LWP) and it looks like we could do some pretty nifty stuff with it if we could kill the guessing code in favour of a DOCTYPE override. I suppose the guessing code could still be supported, but that would be even messier then it is today. The idea behind using HTML::Parser is that we can dump the DOCTYPE extraction (for documents that have a DOCTYPE) on it instead of rolling our own. We lose some control in the short term, but gain cleaner code and make a whole bunch of new features much easier to implement. The two things off the top my head are link checking (A and IMG) and the DOCTYPE override. There is a lot more we could favourably use HTML::Parser for if it proves up to the task. Outlines and incremental processing spring to mind.
Attachments
- application/octet-stream attachment: check_for_doctype.diff
Received on Wednesday, 10 November 1999 21:26:35 UTC