- From: David Woolley <david@djwhome.demon.co.uk>
- Date: Sun, 26 Jun 2005 08:53:08 +0100 (BST)
- To: www-html@w3.org
Peter Kupfer <peter.kupfer@sbcglobal.net> wrote: > Lachlan Hunt wrote: > > Jasper Bryant-Greene wrote: > > The correct way to control the way a spider indexes your site is to use > > robots.txt, assuming the spider in question implements it. Newer spiders also accept meta elements which can specify whether to follow links on the current page and whether to index the content on the current page. (meta element names, are not in general W3C standards and this one is not.) > In a robots.txt file can you control specifically what links a spider > will follow on a certain page, or just that it won't go to a certain > page. I want the spider to eventually hit each subdomain, just not from > the home page, I have it start at each subdomain index? robots.txt (which is not a W3C standard) specifies which pages may be read or not read by the spider. It controls indexing and following. Restricting the routes by which pages can be found is counterproductive for search engines so I cannot see the search engine industry being enthusiastic about implementing such a feature (in any case, even if some update, many will not). > Or, can each subdomain have its own robots.txt. Each subdomain must have its own robots.txt. However, I think you are misusing subdomain. Subdomain refers to the DNS address of the site, e.g. download.microsoft.com is a different subdomain from msdn.microsoft.com. I think you are talking about different "sub-directories" (not the proper URL term) in the same domain.. > I tried to search the archive, but didn't see it there, why was no > follow rejected? (Any formal rejection is done in private, so this is speculation based on the public discussion list.) As a rel attribute name, it was outside the scope of the HTML specification. In general, the HTML specification is not being updated. It was a badly thought out quick fix. It was badly named in that even behaviourally it doesn't mean don't follow. It was implemented as though it were behavioural, which puts it outside the scope of the HTML specification. Clearer thinking would have resulted in a markup that indicated the trust state of not just the link, but the whole third party content section. What nofollow was really saying is that page owner did not trust that content not to abuse the page's function. Your spider's proprietary nofollow element would, I think, get rejected as behavioural given its naming, but its never been suggested on this list, so, unless the spider developer is a member of W3C and has proposed it privately, it has probably never been proposed. If you think there is a good use case for something like this, you need to work out what it is about the nature of the document section that might make a spider not want to follow links, and encode that. My feeling is that any such feature would be encoded as an attribute, not an element. Also, there is a new attribute in XHTML 2 that encodes the function of an element, and that attribute may well be the one to use. It may not even need a new value, as I suspect that there are multiple underlying reasons for not following links and the search engine should make the choice of which ones to honour, base on correct markup using other values. The other approach would be to use a processing instruction, which could, I believe, be behavioural in nature. > Thanks, again please cc to peschtra@yahoo.com as I do not know how to > subscribe to the list. Subscribing to the list really is easy (it is a suject controlled list, so simply using the standard (for non-commercial lists) request address and a subject of subscribe will do it). Not subscribing is almost guaranteed to result in people forgetting to CC you (although these lists have a lot of people who break netiquette and always CC).
Received on Sunday, 26 June 2005 08:09:08 UTC