Re: CPU usage from Ville Skyttä on 2003-07-25 (www-validator@w3.org from July 2003)

From: Ville Skyttä <ville.skytta@iki.fi>
Date: 25 Jul 2003 21:39:05 +0300
To: www-validator@w3.org
Message-Id: <1059158345.25012.81.camel@bobcat.mine.nu>

On Fri, 2003-07-25 at 11:16, Centaur zeus wrote:

[...]
> I found that it actually parsed two documents, one is the one I requested 
> and another is the one of the html link.
[...]

> 1) Why the html link is parsed again ?

There are 2 situations where the linked documents need to be parsed:

1) Recursive checking.  Obviously, if in recursion, the linked documents
   need to be fetched and parsed in full to extract other links from 
   them.

2) The links contain fragments, eg. <http://foo/bar#quux>.  To check
   whether there is an ID "quux" in the linked document, it needs to be 
   fetched and parsed.

> 2) is it appropriate to change if (being_processed) to if (0) and what's the 
> impact ?

The ability to check fragments' "validity" would be gone.

> 3) How can I minimize the resource used by the LWP and HTTP package ?

Use nice(1) :)

If you're looking into optimizing the code, one possibility would be to
avoid instantiation of W3C::UserAgent and W3C::CheckLink objects. 
Instantiating these also means instantiating eg. new HTML::Parser
objects etc.  Since checklink doesn't operate in parallel, one UserAgent
and one CheckLink instance could be enough for one run of the script.

And looking into checking links in parallel would probably result in
quicker completion times, though most likely at the expense of somewhat
bigger resource usage.

If you roll up your sleeves and do some work, please don't hesitate to
send patches!

Cheers,
-- 
\/

Received on Friday, 25 July 2003 14:39:08 UTC