Re: Some bot is visiting my translations from olivier Thereaux on 2005-08-18 (w3c-translators@w3.org from July to September 2005)

From: olivier Thereaux <ot@w3.org>
Date: Thu, 18 Aug 2005 14:32:22 +0900
To: Andrei Stanescu <andre@siteuri.ro>
Cc: w3c-translators@w3.org
Message-Id: <370C4E22-6291-4E74-9600-DB2350089B4A@w3.org>

On 16 Aug 2005, at 05:17, Andrei Stanescu wrote:
> Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0) Fetch API Request
>
> ...has ridiculously high request rates, about 10 / page / day. It  
> only visits my W3C translations, and has done so for months.
>
> Anyone has any idea what this is and whether it is used by W3C?  
> Otherwise I will ban it.

Definitely not a W3C robot. The only thing that qualifies as such is  
the link checker, and it has a different user agent signature. That  
said, 10 requests per page per day isn't incredibly high if his doc  
is linked from W3C, I think. You wouldn't believe how some robots   
behave...

Looking around for a few minutes, I could read that this was the UA  
signature for spam harvesters, or that it was just a specific proxy- 
cache software refreshing its cache. Nothing certain.

In any case, making said robot send fewer requests is hardly an  
option unless you know who is using it (the best way to figure out  
who is behind the robot would be to look at the IP from which the  
requests come). But there are robots.txt directives to refuse access  
to a robot with a specific signature, e.g:

User-agent: Fetch API Request
Disallow: /my/area

and if the robot is impolite and does not follow the robots exclusion  
protocol, then there's an arsenal of mod_rewrite and "deny from"  
possibilities (or equivalent if not apache server).

Hope this helps.
-- 
olivier

Received on Thursday, 18 August 2005 05:32:25 UTC