W3C home > Mailing lists > Public > ietf-discuss@w3.org > March 2001

Re: Recognizing search engine spiders

From: Mark Nottingham <mnot@akamai.com>
Date: Sat, 3 Mar 2001 08:56:24 -0800
To: Jacob Palme <jpalme@dsv.su.se>
Cc: discuss@apps.ietf.org
Message-ID: <20010303085622.B24919@akamai.com>

On Sat, Mar 03, 2001 at 10:31:02AM +0100, Jacob Palme wrote:
> Is there any standard which search engines use when sending
> HTTP requests during spidering, in order to tell the
> receipient HTTP server that they are search engines.

No. Some try to use a heuristic, with varying success (the big ones
are pretty easy to get).

> I can see multiple uses of this. In our particular case,
> we sometimes intentionally create slightly varying URLs
> of the same document, in order to stop an old version
> in the cache to be used. (Yes, I know there are cache
> control standards, but they do not seem to work in all
> cases.) This might mean that a search engine would
> store multiple copies of nearly the same document,
> and would not recognize that a new version replaces an
> old version of the same document.

There are ways to assure cache freshness. See
If that isn't good enough, vary the reference's query string, most
search engines will understand.

Also, you might try using robots.txt to shape which documents will be

Mark Nottingham, Research Scientist
Akamai Technologies (San Mateo, CA USA)
Received on Saturday, 3 March 2001 11:57:06 UTC

This archive was generated by hypermail 2.3.1 : Tuesday, 6 January 2015 20:38:01 UTC