Caching of GET-based queries

    However, it occurs to me that log data from altavista might provide a
    bit of insight as to whether there is any point in caching such results.
    Perhaps in the real world, there won't be enough commonality in what
    users ask for to bother, even for a heavily used service.
    
Thanks for prodding me to take a look at the logs.  I extracted
the query URLs from the logs, ran these through "sort|uniq -c|sort -n|tail",
and looked at the results.

First of all, the most "popular" kind of query seems to be various
malformed ones, and it might in general be a bad idea to cache the
responses to these.  (We probably need to think about the issue
of "caching of negative results".)

After that, the most popular queries are the few dozen or so
that are generated by following the Altavista "Surprise"
button and then clicking on one of the categories listed there.
As you can see from viewing the source of that page, these
links are of the form "/cgi-bin/query?pg=s&target=14".  Since
they are supposed to return a redirection to *different* random
link each time, the responses are inherently not cachable!

On Shel's urging, I added this paragraph to my draft proposal:
   Apparently, some applications have used GETs and HEADs with query
   URIs (those containing a ``?'' in the rel_path part) to perform
   operations with significant side effects.  Therefore, caches MUST NOT
   treat responses to such URIs as fresh unless the server provides an
   explicit Fresh-until time.  In other words, if a cache must assign a
   heuristic Fresh-until time to such a response, the value MUST be
   zero.

Anyway, once you subtract the random-result queries, it looks like
about 75%-80% of the queries (on a daily basis) are unique.  So
even an infinitely large cache could only get a 20%-25% hit rate.

And over a shorter time interval (a probably non-representative
15 minutes), the non-unique-query rate was even lower, around 10%.

Anyway, I think what this means is that anyone trying to cache
the results of queries to Web-search engines is likely to be
disappointed.

-Jeff

P.S.: Except, alas, if you are caching the results of queries
for certain prurient topics.

Received on Saturday, 6 January 1996 00:03:06 UTC