- From: Mark Nottingham <mnot@akamai.com>
- Date: Sat, 3 Mar 2001 08:56:24 -0800
- To: Jacob Palme <jpalme@dsv.su.se>
- Cc: discuss@apps.ietf.org
On Sat, Mar 03, 2001 at 10:31:02AM +0100, Jacob Palme wrote: > Is there any standard which search engines use when sending > HTTP requests during spidering, in order to tell the > receipient HTTP server that they are search engines. No. Some try to use a heuristic, with varying success (the big ones are pretty easy to get). > I can see multiple uses of this. In our particular case, > we sometimes intentionally create slightly varying URLs > of the same document, in order to stop an old version > in the cache to be used. (Yes, I know there are cache > control standards, but they do not seem to work in all > cases.) This might mean that a search engine would > store multiple copies of nearly the same document, > and would not recognize that a new version replaces an > old version of the same document. There are ways to assure cache freshness. See http://www.mnot.net/cache_docs/ If that isn't good enough, vary the reference's query string, most search engines will understand. Also, you might try using robots.txt to shape which documents will be fetched. -- Mark Nottingham, Research Scientist Akamai Technologies (San Mateo, CA USA)
Received on Saturday, 3 March 2001 11:57:06 UTC