Re: Recognizing search engine spiders

At 08.56 -0800 01-03-03, Mark Nottingham wrote:
>There are ways to assure cache freshness. See
>If that isn't good enough, vary the reference's query string, most
>search engines will understand.

After experimentation, we found the best success with
adding ";12345678, where 12345678 is a number which
changes every time the document is modified, at
the end of those URLs which refer to pages which are
frequently updated.

You can look at for example
to see how our pages look like. If you click on a
link in those pages, ";12345678" with a new
number is added at the end of most of the pages
whenever the page is changed, in order to avoid
users getting old, cached versions of the document
they ask for.

But the document you referenced might teach us a method
of avoiding stale pages to user without adding ;12345678
at the end of our URLs. It seems like a very comprehensive
tutorial on how to get your documents to be handled correctly
by caches.

>Also, you might try using robots.txt to shape which documents will be

At present we are also using robots.txt to suppress
search engines. Some of our content might however
be suitable for search engines, and then we must
ensure that we do not send the ";12345" URL-s to
them, because this would cause them to store
multiple references to almost the same document
(something which already is a big problem for
search engine implementors), and which all would
in reality refer users to the same document.

I can guess that a standard method of recognizing
when a HTTP client is a search engine might be
controversial, since spammers would of course be
very happy to be able to deliver different content
to search engines than to ordinary users in order
to pull users to their sites. But they probably
already have learned to do that using various
heuristics, as you suggest.

Jacob Palme <> (Stockholm University and KTH)
for more info see URL:

Received on Saturday, 3 March 2001 16:40:22 UTC