Re: Question about web spiders... from Lachlan Hunt on 2005-06-26 (www-html@w3.org from June 2005)

From: Lachlan Hunt <lachlan.hunt@lachy.id.au>
Date: Sun, 26 Jun 2005 18:08:27 +1000
To: Peter Kupfer <peter.kupfer@sbcglobal.net>
CC: www-html@w3.org
Message-ID: <42BE627B.8070906@lachy.id.au>

Peter Kupfer wrote:
> Lachlan Hunt wrote:
>> The correct way to control the way a spider indexes your site is to 
>> use robots.txt, assuming the spider in question implements it.
> 
> In a robots.txt file can you control specifically what links a spider 
> will follow on a certain page,

No, it controls which pages on a server the spider can access.

>  or just that it won't go to a certain page.

Essentially, yes.

> I want the spider to eventually hit each subdomain, just not from 
> the home page, I have it start at each subdomain index?

Then HTML is the wrong place to specify such behaviour and robots.txt is 
probaly not suitable for you either.  HTML is designed to markup the 
semantics of the document's content by saying *what* the content is, not 
describe how the content should be processed by a particular UA.  Having 
said that though, processing instructions [1] are designed to supply 
system specific information, but I don't know how suitable they would be 
for your particular needs.

I don't understand why it matters which path is followed to reach 
subdomains, but I think you need to find a way to configure the robot 
itself, not try to give it instructions from within the documents it reads.

> Or, can each subdomain have its own robots.txt.

Yes, AFAIK, spiders look for robots.txt in the root directory of every 
domain, regardless of whether it's a top-level domain or subdomain.
eg.
http://example.com/robots.txt
http://subdomain.example.com/robots.txt

In any case, this is completely off topic for this HTML related list.

>> nofollow was discussed quite extensively on this list when Google
>> introduced it and the vast majority of this community rejected it.
> 
> I tried to search the archive, but didn't see it there, why was no 
> follow rejected?

Then you didn't look very hard.  A search for "nofollow" in the archives 
reveals most of the thread, appearing just below the messages from this 
thread.  For your convenience, it actually started with a message on 
www-html-editor [2|3], with most of the followup discussion on www-html [4].

[1] http://www.is-thought.co.uk/book/sgml-8.htm#PI
[2] http://lists.w3.org/Archives/Public/www-html-editor/2005JanMar/0010
[3] http://lists.w3.org/Archives/Public/www-html-editor/2005JanMar/thread#10
[4] http://lists.w3.org/Archives/Public/www-html/2005Jan/thread#64

-- 
Lachlan Hunt
http://lachy.id.au/

Received on Sunday, 26 June 2005 08:08:41 UTC