Re: Unwanted robot accesses from your site

On Wed, 2002-12-25 at 09:05, Olivier Thereaux wrote:

> On Wednesday, Dec 25, 2002, at 13:48 Asia/Tokyo, Bjoern Hoehrmann wrote:
> > Why doesn't checklink qualify as a robot?
> 
> My own definition of a robot is that it retrieves some data (the 
> documents) or metadata (indexing). I may be wrong.

Checklink definitely retrieves documents.  It doesn't store any
information or present the document to its user though.  I don't think
that storing, indexing, etc. is a criteria whether something is a robot
or not.

I don't know if there's an authoritative definition anywhere, but The
Web Robots FAQ [1] has one.  It also mentions "HTML validation" and
"Link validation" as purposes robots can be used for.  The database [2]
on that site doesn't contain the W3C Validator or Link Checker, though. 
But there are other link checkers in the list ("Link Validator",
"LinkScan", "LinkWalker" ...).

> In any case I don't think checklink, even in recursive mode, should 
> follow the robots directives (noindex is irrelevant, and nofollow would 
> make it useless...). I'm interested to hear opposite arguments, though.

I tend to see checklink as a robot, and that it should be fully
following the robots exclusion standard in all modes.  Without an
authoritative specification, this is tough to back up (which is already
evident in this thread).

See the related stuff also in [3] and [4], and the Bugzilla enhancement
entry at [5].

[1] <http://www.robotstxt.org/wc/faq.html>
[2] <http://www.robotstxt.org/wc/active/html/>
[3] <http://www.kollar.com/robots.html>
[4] <http://www.w3.org/Search/9605-Indexing-Workshop/ReportOutcomes/Spidering.txt>
[5] <http://www.w3.org/Bugs/Public/show_bug.cgi?id=27>

-- 
\/ille Skyttä
ville.skytta at iki.fi

Received on Wednesday, 25 December 2002 05:40:03 UTC