- From: James Pitkow <pitkow@cc.gatech.edu>
- Date: Thu, 6 Mar 1997 02:22:11 -0500 (EST)
- To: http-wg@cuckoo.hpl.hp.com
Hello,
"Jeffrey Mogul" wrote at Mar 4, 97 05:57:57 pm:
>
> I'd surely like to see a well-defined description, including some
> analysis, of these other possible techniques, and perhaps James
> Pitkow's paper (when it becomes available) will shed some light.
> But I'm not interested in continuing a debate of the form "I
> know a better way to do this, but I'm not going to provide a
> detailed argument about why it is better."
Sorry for the delay (travel). For some strange reason, a last minute
paper I wrote on collecting reliable usage data was accepted for WWW6. The
main concerns it raises about hit-metering (which I think is very well stated)
include:
*) User path data is lost/not collectable.
*) It relies upon the cooperation of independent caches. Since this can not
be controlled, the amount of gain from implementing the proposal can not be
determined.
*) Collection periods can not be reliably controlled. Since caches
are not forced to report by a certain time, an indeterminable amount of
data could be tallied with the next collection period. The usage-
limiting mechanisms can help alleviate this, though a) not completely and
b) at the cost of more traffic (defeating one of the proposals goals).
*) Failure policies are not specified. While the authors readily admit
this, the amount of error injected into the system can not be determined.
*) As a result of these limitations, comparisons between collection periods
can be misleading. Did a 5% decrease have to do with the stuff on the site
or a faulty cache, or a network failure, or a report being mis-tallied?
I argue that there is no way to reliably know.
The paper then outlines various sampling methods for the Web. I argue that:
*) IP address based sampling is tricky, if not impossible, to use to generate
a random sample.
* Randomly sampling users is better. Only perform cache-busting on randomly
chosen users. This form of sampling does not suffer from the above hit-metering
limitations.
*) The amount of confidence to place in the numbers can be determined.
*) Comparisons between collection periods are more robust.
*) Network failures are correctly handled.
*) Path data can be collected.
*) User privacy is arguably enhanced. This is definitely the case over current
full caching busting, and compared against hit-metering, more information is
gathered about fewer users.
The paper is temporarily accessible from:
http://www.gvu.gatech.edu/t/PAPER126.html
I readily admit that it is not stellar paper and was written for non-technical
people, so grains of salt are in order.
Jim.
Received on Wednesday, 5 March 1997 23:24:15 UTC