On Sat, Mar 03, 2001 at 10:31:02AM +0100, Jacob Palme wrote: > Is there any standard which search engines use when sending > HTTP requests during spidering, in order to tell the > receipient HTTP server that they are search engines. No. Some try to use a heuristic, with varying success (the big ones are pretty easy to get). > I can see multiple uses of this. In our particular case, > we sometimes intentionally create slightly varying URLs > of the same document, in order to stop an old version > in the cache to be used. (Yes, I know there are cache > control standards, but they do not seem to work in all > cases.) This might mean that a search engine would > store multiple copies of nearly the same document, > and would not recognize that a new version replaces an > old version of the same document. There are ways to assure cache freshness. See http://www.mnot.net/cache_docs/ If that isn't good enough, vary the reference's query string, most search engines will understand. Also, you might try using robots.txt to shape which documents will be fetched. -- Mark Nottingham, Research Scientist Akamai Technologies (San Mateo, CA USA)Received on Saturday, 3 March 2001 11:57:06 GMT
This archive was generated by hypermail 2.2.0+W3C-0.50 : Thursday, 23 March 2006 20:11:27 GMT