Re: Recognizing search engine spiders from Mark Nottingham on 2001-03-03 (ietf-discuss@w3.org from March 2001)

From: Mark Nottingham <mnot@akamai.com>
Date: Sat, 3 Mar 2001 08:56:24 -0800
To: Jacob Palme <jpalme@dsv.su.se>
Cc: discuss@apps.ietf.org
Message-ID: <20010303085622.B24919@akamai.com>

On Sat, Mar 03, 2001 at 10:31:02AM +0100, Jacob Palme wrote:
> Is there any standard which search engines use when sending
> HTTP requests during spidering, in order to tell the
> receipient HTTP server that they are search engines.

No. Some try to use a heuristic, with varying success (the big ones
are pretty easy to get).

> I can see multiple uses of this. In our particular case,
> we sometimes intentionally create slightly varying URLs
> of the same document, in order to stop an old version
> in the cache to be used. (Yes, I know there are cache
> control standards, but they do not seem to work in all
> cases.) This might mean that a search engine would
> store multiple copies of nearly the same document,
> and would not recognize that a new version replaces an
> old version of the same document.

There are ways to assure cache freshness. See
  http://www.mnot.net/cache_docs/
If that isn't good enough, vary the reference's query string, most
search engines will understand.

Also, you might try using robots.txt to shape which documents will be
fetched.

-- 
Mark Nottingham, Research Scientist
Akamai Technologies (San Mateo, CA USA)

Received on Saturday, 3 March 2001 11:57:06 UTC