Re: Question about web spiders... from Peter Kupfer on 2005-06-27 (www-html@w3.org from June 2005)

From: Peter Kupfer <peter.kupfer@sbcglobal.net>
Date: Sun, 26 Jun 2005 23:14:46 -0500
To: Lachlan Hunt <lachlan.hunt@lachy.id.au>
CC: www-html@w3.org
Message-ID: <42BF7D36.7040302@sbcglobal.net>

Lachlan Hunt wrote:
> Peter Kupfer wrote:
> 
>> Lachlan Hunt wrote:
>>
>>> The correct way to control the way a spider indexes your site is to 
>>> use robots.txt, assuming the spider in question implements it.
>>
>> In a robots.txt file can you control specifically what links a spider 
>> will follow on a certain page,
> 
> No, it controls which pages on a server the spider can access.
> 
>>  or just that it won't go to a certain page.
> 
> Essentially, yes.

This is what I thought, so, as you concluded, a robots.txt won't fix my 
problem here. :(

>> I want the spider to eventually hit each subdomain, just not from the 
>> home page, I have it start at each subdomain index?
> 
> Then HTML is the wrong place to specify such behaviour and robots.txt is 
> probaly not suitable for you either.  HTML is designed to markup the 
> semantics of the document's content by saying *what* the content is, not 
> describe how the content should be processed by a particular UA.  Having 
> said that though, processing instructions [1] are designed to supply 
> system specific information, but I don't know how suitable they would be 
> for your particular needs.

Fair enough.

> 
> I don't understand why it matters which path is followed to reach 
> subdomains, but I think you need to find a way to configure the robot 
> itself, not try to give it instructions from within the documents it reads.

With this service, freefind, it makes a site map, and depending on the 
path it takes through the site, varies how the site map is displayed.

>>> nofollow was discussed quite extensively on this list when Google
>>> introduced it and the vast majority of this community rejected it.
>>
>> I tried to search the archive, but didn't see it there, why was no 
>> follow rejected?
> 
> Then you didn't look very hard.  A search for "nofollow" in the archives 
> reveals most of the thread, appearing just below the messages from this 
> thread.  For your convenience, it actually started with a message on 
> www-html-editor [2|3], with most of the followup discussion on www-html 
> [4].
> 
> [1] http://www.is-thought.co.uk/book/sgml-8.htm#PI
> [2] http://lists.w3.org/Archives/Public/www-html-editor/2005JanMar/0010
> [3] 
> http://lists.w3.org/Archives/Public/www-html-editor/2005JanMar/thread#10
> [4] http://lists.w3.org/Archives/Public/www-html/2005Jan/thread#64

Perhaps. I searched for no follow, not in quotes and with a space, and I 
  got subjects like, "XML tags are just a cheap rip-off of PHP tags" & 
"DC in XHTML2", and other things that were not what I wanted. I will go 
back and search "nofollow", it didn't occur to me to leave out the space.

Thanks!


-- 
Peter Kupfer
peschtra@yahoo.com

Received on Monday, 27 June 2005 04:14:52 UTC