- From: olivier Thereaux <ot@w3.org>
- Date: Mon, 11 Apr 2005 15:56:08 +0900
- To: Ville Skyttä <ville.skytta@iki.fi>
- Cc: QA Dev <public-qa-dev@w3.org>
On Mar 11, 2005, at 16:54, Ville Skyttä wrote:
>> Cat 1: make the link checker faster
>> * RobotUA has a minimum latency of 1s between requests, so we can't
>> make one W3C::UserAgent instance be much faster, but we could use
>> several.
>
> Right, assuming we can get those UserAgents to crawl in parallel. But
> that would require some possibly nontrivial changes to the link
> checker.
Agreed. Still, the more I think of it, the more I am convinced that
this is our best bet.
What we have now in sub check_uri is basically
# check document
# Record all the links found
foreach my $u (keys %links) {
...
&check_validity();
}
...
I suppose that before the foreach loop we could classify the links by
host (or by IP, although the extra DNS queries might slow us down) and
spawn n UAs to check the validity of each.
> For example, not too long ago, I specifically made sure that we use
> only
> one UA; before that, IIRC WLC was fetching /robots.txt for every single
> link due to robots info not being shared between different UAs spawned
> here and there.
Are we using / could we use a global hash of checked robots rule files
along %processed, %results and %redirects?
> Also, the report on progress needs to be revamped if
> there are multiple UAs working simultaneously.
Not much, if I am not mistaken. As far as I can tell, the output for
"starting to process URI foo" and "URI foo processed in X time - result
bar" are separate.
> I have a feeling that it's not feasible to implement this in the
> current
> codebase before m12n.
I would say it's probably feasible, but it makes more sense to do it
after, or during, m12n. I am mostly trying to come up with a solution
we could aim to implement when ever ready...
--
olivier
Received on Monday, 11 April 2005 06:56:14 UTC