Re: New document on "Simple hit-metering for HTTP" from Jeffrey Mogul on 1996-08-14 (ietf-http-wg@w3.org from July to September 1996)

From: Jeffrey Mogul <mogul@pa.dec.com>
Date: Wed, 14 Aug 96 16:50:25 MDT
To: Koen Holtman <koen@win.tue.nl>
Cc: http-wg%cuckoo.hpl.hp.com@hplb.hpl.hp.com
Message-Id: <9608142350.AA17283@acetes.pa.dec.com>

    However, you have handed out 80*10=1000 uses, which gives you 800 hits
    as the upper bound.  So all you know is:
    
      80 <= actual hits <= 800
    
    This is not what I call useful information.  Something like an
    interesting upper bound would be
    
      80 <= actual hits <= 100
    
    but I see no way in which max-uses can provide such a bound.
    
    I suspect that max-uses counts higher than 3 will be disastrously
    ineffective at yielding a useful upper bound if uncooperative caches
    are common.
    
    A proxy not being cooperative and only supporting max-uses seems about
    as bad as a proxy not supporting hit counts at all.
    
If I understand your argument, it is that in order to bound the
size of the error in the hit count to lie within a reasonable
range, the max-uses setting would have to be so small that it
would effectively disable caching.

    I'd like to see *actual statistics* disprove my argument

So I got a day's worth of log entries from our proxy.  Here are
some statistics:

	589705	total log entries
	529756	after removing non-HTTP URLs with "?", "cgi", or "htbin"
	245481	unique "cachable" URLs
	189723  "cachable" URLs referenced only once during the trace
	 55758	"cachable" URLs referenced more than once

That's an effective cache hit rate of about 23%, not counting
things that can't be cached, and ignoring any misses that were
caused by modifications to the resources.

Supposing that, for each of the "cachable" URLs referenced more than
once, the origin server sent max-uses=3.

Of the
	 55758 "cachable" URLs referenced more than once
	 28951 (52%) were referenced exactly twice
	  9592 (17%) were referenced exactly 3 times

Or in other words, of the
	340033 references to "cachable" URLs referenced more than once
	28951*2 + 9592*3 = 86678 of these references were to URLs
		referenced 2 or 3 times 
    so
	340033 - 86678 = 253355 of these references were to URLs
		referenced more than 3 times

Now, assume that the servers had all sent max-uses=3 for these
URLs.  Then the first use of each of these URLs (55758 uses)
plus every 4th use of each of the URLs referenced more than
3 times (roughly 253355/4 = 63339 uses) would have to be forwarded
to the origin server.  This means that 340033 - (63339 + 55758)
220936 uses would not have to be forwarded to the origin server,
which comes out to about 37% of all the references logged.

Now, it's quite true that not every server insists on demographics
information, and so the actual number of references saved would
presumably be lower.  But this should give some idea of the
magnitude of the possible savings, and I don't think it's insignificant.

-Jeff

Received on Wednesday, 14 August 1996 17:01:02 UTC