- From: Ville Skyttä <ville.skytta@iki.fi>
- Date: Fri, 11 Mar 2005 09:54:22 +0200
- To: olivier Thereaux <ot@w3.org>
- Cc: QA Dev <public-qa-dev@w3.org>
On Fri, 2005-03-11 at 16:06 +0900, olivier Thereaux wrote: A quick reply to the first point (I'm in a hurry now), more next week. > Cat 1: make the link checker faster > * RobotUA has a minimum latency of 1s between requests, so we can't > make one W3C::UserAgent instance be much faster, but we could use > several. Right, assuming we can get those UserAgents to crawl in parallel. But that would require some possibly nontrivial changes to the link checker. For example, not too long ago, I specifically made sure that we use only one UA; before that, IIRC WLC was fetching /robots.txt for every single link due to robots info not being shared between different UAs spawned here and there. Also, the report on progress needs to be revamped if there are multiple UAs working simultaneously. I have a feeling that it's not feasible to implement this in the current codebase before m12n. > I assume this is what the following comment means: > [[ > My $ua = W3C::UserAgent->new($AGENT); # @@@ TODO: admin address > # @@@ make number of keep-alive connections customizable > ]] Not really. A single UA can have a number of kept-alive connections, that's what this comment is about. I remember thinking about sorting the list of known to-be-checked URLs (after canonicalization) and processing in that order for better keep-alive utilization, but I don't remember if I did anything about that. > Having a configurable number could make sense. We could also spawn one > W3C::UserAgent per target host (would require changes in how and when > the parsing of the links are done, I suppose?) Probably right. See also above for other things involving sharing stuff between the agents. I'll look through this list early next week when I have more time.
Received on Friday, 11 March 2005 07:54:58 UTC