Re: [checklink] outsource links extraction? from Ville Skyttä on 2005-10-24 (public-qa-dev@w3.org from October 2005)

From: Ville Skyttä <ville.skytta@iki.fi>
Date: Mon, 24 Oct 2005 09:34:52 +0300
To: QA-dev Dev <public-qa-dev@w3.org>
Message-Id: <1130135692.4920.127.camel@localhost.localdomain>

On Mon, 2005-10-24 at 08:18 +0900, olivier Thereaux wrote:

> I recall that the link checker is based on HTML::Parser, and that's  
> the base object used to parse the documents and extract the links. I  
> noticed recently that there were a couple of libs that we may want to  
> use instead, such as HTML::LinkExtor (actually a subclass of  
> HTML::Parser) or HTML::LinkExtractor.
> 
> Does anyone remember if these have already been considered, and if  
> yes, why we chose not to use them?

I dimly remember having a look at those some time ago.  There are at
least a couple of things worth noting: neither provides any line/column
number locator information, and neither deals with anchors.  Both of
these could be probably taken care of through subclassing, but we'd need
to do some parsing ourselves anyway, so the savings from the outsourcing
might not be that big.

On a semi-related note, HTML::Parser 3.19_94 and later have "line" and
"column" argspecs which I think could be used instead of counting lines
ourselves.

Received on Monday, 24 October 2005 06:34:56 UTC