Re: Question about web spiders...

Peter Kupfer <peter.kupfer@sbcglobal.net> wrote:
> Lachlan Hunt wrote:
> > Jasper Bryant-Greene wrote:
 
> > The correct way to control the way a spider indexes your site is to use 
> > robots.txt, assuming the spider in question implements it.

Newer spiders also accept meta elements which can specify whether to follow
links on the current page and whether to index the content on the current
page.  (meta element names, are not in general W3C standards and this
one is not.)

> In a robots.txt file can you control specifically what links a spider 
> will follow on a certain page, or just that it won't go to a certain 
> page. I want the spider to eventually hit each subdomain, just not from 
> the home page, I have it start at each subdomain index?

robots.txt (which is not a W3C standard) specifies which pages may be
read or not read by the spider.  It controls indexing and following.

Restricting the routes by which pages can be found is counterproductive
for search engines so I cannot see the search engine industry being
enthusiastic about implementing such a feature (in any case, even if
some update, many will not).

> Or, can each subdomain have its own robots.txt.

Each subdomain must have its own robots.txt.  However, I think you
are misusing subdomain.  Subdomain refers to the DNS address of the
site, e.g. download.microsoft.com is a different subdomain from
msdn.microsoft.com.  I think you are talking about different 
"sub-directories" (not the proper URL term) in the same domain..

> I tried to search the archive, but didn't see it there, why was no 
> follow rejected?

(Any formal rejection is done in private, so this is speculation based
on the public discussion list.)

As a rel attribute name, it was outside the scope of the HTML
specification.

In general, the HTML specification is not being updated.

It was a badly thought out quick fix.

It was badly named in that even behaviourally it doesn't mean don't
follow.

It was implemented as though it were behavioural, which puts it outside
the scope of the HTML specification.  Clearer thinking would have resulted
in a markup that indicated the trust state of not just the link, but the
whole third party content section.  What nofollow was really saying is
that page owner did not trust that content not to abuse the page's 
function.

Your spider's proprietary nofollow element would, I think, get rejected
as behavioural given its naming, but its never been suggested on this list,
so, unless the spider developer is a member of W3C and has proposed it
privately, it has probably never been proposed.

If you think there is a good use case for something like this, you need
to work out what it is about the nature of the document section that
might make a spider not want to follow links, and encode that.  My
feeling is that any such feature would be encoded as an attribute,
not an element.  Also, there is a new attribute in XHTML 2 that encodes
the function of an element, and that attribute may well be the one
to use.  It may not even need a new value, as I suspect that there
are multiple underlying reasons for not following links and the search
engine should make the choice of which ones to honour, base on 
correct markup using other values.

The other approach would be to use a processing instruction, which 
could, I believe, be behavioural in nature.

> Thanks, again please cc to peschtra@yahoo.com as I do not know how to 
> subscribe to the list.

Subscribing to the list really is easy (it is a suject controlled list,
so simply using the standard (for non-commercial lists) request address
and a subject of subscribe will do it).  Not subscribing is almost 
guaranteed to result in people forgetting to CC you (although these lists
have a lot of people who break netiquette and always CC).

Received on Sunday, 26 June 2005 08:09:08 UTC