- From: olivier Thereaux <ot@w3.org>
- Date: Mon, 11 Apr 2005 15:56:08 +0900
- To: Ville Skyttä <ville.skytta@iki.fi>
- Cc: QA Dev <public-qa-dev@w3.org>
On Mar 11, 2005, at 16:54, Ville Skyttä wrote: >> Cat 1: make the link checker faster >> * RobotUA has a minimum latency of 1s between requests, so we can't >> make one W3C::UserAgent instance be much faster, but we could use >> several. > > Right, assuming we can get those UserAgents to crawl in parallel. But > that would require some possibly nontrivial changes to the link > checker. Agreed. Still, the more I think of it, the more I am convinced that this is our best bet. What we have now in sub check_uri is basically # check document # Record all the links found foreach my $u (keys %links) { ... &check_validity(); } ... I suppose that before the foreach loop we could classify the links by host (or by IP, although the extra DNS queries might slow us down) and spawn n UAs to check the validity of each. > For example, not too long ago, I specifically made sure that we use > only > one UA; before that, IIRC WLC was fetching /robots.txt for every single > link due to robots info not being shared between different UAs spawned > here and there. Are we using / could we use a global hash of checked robots rule files along %processed, %results and %redirects? > Also, the report on progress needs to be revamped if > there are multiple UAs working simultaneously. Not much, if I am not mistaken. As far as I can tell, the output for "starting to process URI foo" and "URI foo processed in X time - result bar" are separate. > I have a feeling that it's not feasible to implement this in the > current > codebase before m12n. I would say it's probably feasible, but it makes more sense to do it after, or during, m12n. I am mostly trying to come up with a solution we could aim to implement when ever ready... -- olivier
Received on Monday, 11 April 2005 06:56:14 UTC