Re: New document on "Simple hit-metering for HTTP" from Koen Holtman on 1996-08-17 (ietf-http-wg@w3.org from July to September 1996)

From: Koen Holtman <koen@win.tue.nl>
Date: Sun, 18 Aug 1996 00:01:32 +0200 (MET DST)
To: Jeffrey Mogul <mogul@pa.dec.com>
Cc: koen@win.tue.nl, http-wg%cuckoo.hpl.hp.com@hplb.hpl.hp.com
Message-Id: <199608172201.AAA06357@wsooti04.win.tue.nl>
Jeffrey Mogul:
>
>If I understand your argument, it is that in order to bound the
>size of the error in the hit count to lie within a reasonable
>range, the max-uses setting would have to be so small that it
>would effectively disable caching.

Yes.  For uncooperative caches.

     [Koen Holtman:]
>    I'd like to see *actual statistics* disprove my argument

>So I got a day's worth of log entries from our proxy.  Here are
>some statistics:
>
>        589705  total log entries
>        529756  after removing non-HTTP URLs with "?", "cgi", or "htbin"
>        245481  unique "cachable" URLs
>        189723  "cachable" URLs referenced only once during the trace
>         55758  "cachable" URLs referenced more than once

It's very tricky to extrapolate from a day's worth of log entries: to
do these statistics right, you would have to count over the lifetime
of a cache entry, which is presumably a lot longer than 1 day for your
cache.  I find it difficult to guess in what direction your end
results would change if you calculate over log entry lifetimes.

>That's an effective cache hit rate of about 23%, not counting
>things that can't be cached, and ignoring any misses that were
>caused by modifications to the resources.

Eek! I would calculate a ( 529756 - 245481 ) / 529756 * 100% = 54% hit
rate for your figures, also ignoring misses due to modification
(including the semi-modification known as cache busting!).  What is
your definition of hit rate?

>Supposing that, for each of the "cachable" URLs referenced more than
>once, the origin server sent max-uses=3.
[...]
>220936 uses would not have to be forwarded to the origin server,
>which comes out to about 37% of all the references logged.

So if cache busting is replaced by max-uses=3, you expect a 37% cache
hit rate (i.e. RTT savings in 37% of all cases) in an uncooperative
cache, where it earlier had a 0% hit rate for the offending server

There are several factors to pollute this figure: 1 day sample, not
factoring out dynamic and authenticated content which is uncachable,
not counting the 8th, 12th, ... hits, but let's forget about those.

>Now, it's quite true that not every server insists on demographics
>information, and so the actual number of references saved would
>presumably be lower.  But this should give some idea of the
>magnitude of the possible savings, and I don't think it's insignificant.

Your statistics don't answer the main question I have: does max-uses=3
(or max-uses=2 for that matter) give a good enough upper bound to make
sites switch from cache busting to max-uses=3?

Using figures from your post:

         245481 unique "cachable" URLs
         228240 of these were referenced 1, 2, or 3 times
          17215 were referenced more than 3 times

         529756 references on "cachable" URLs
         276401 references to URLs referenced 1, 2, or 3 times.
         253355 of these references were to URLs
                referenced more than 3 times

We can calculate how good the upper bound is.  If we assume
optimistically that all references to `more than 3 times' URLs are
reported under max-uses=3, we have

   228240 + 253355 = 481595 known uses.

For the 1,2,3 URLs, the server handed out 3 * 228240 = 684720 uses
which never led to any reports.  481595 + 684720 = 11663135.  This
means that the origin server knows

  481595  <= actual uses <= 11663135 .

But this upper bound is a factor 2.4 higher, which makes it hardly
useful.  So max-uses=3 *still* gives you a useless upper bound, and
you can't expect that people will switch from using cache busting to
using max-uses=3.

(Note that the real actual uses, 529756 uses, are only a factor 1.1
higher than the 481595 reported, but this good figure is caused for a
large part by the optimistic assumption that all uses of `more than 3
times' URLs are reported.)

Now, to do all of the above statistics _right_, you would have to have
figures on how many times the contents of a cache slot are served
during the lifetime of the cache slot.  Unfortunately, I don't know of
any data set with these figures.  But I feel safe in saying that we
can forget about the uncooperative cache option.  It won't work, and
should be removed from the draft to make it shorter.

>-Jeff

Koen.
Received on Saturday, 17 August 1996 15:04:48 UTC