- From: James Pitkow <pitkow@cc.gatech.edu>
- Date: Thu, 6 Mar 1997 02:22:11 -0500 (EST)
- To: http-wg@cuckoo.hpl.hp.com
Hello, "Jeffrey Mogul" wrote at Mar 4, 97 05:57:57 pm: > > I'd surely like to see a well-defined description, including some > analysis, of these other possible techniques, and perhaps James > Pitkow's paper (when it becomes available) will shed some light. > But I'm not interested in continuing a debate of the form "I > know a better way to do this, but I'm not going to provide a > detailed argument about why it is better." Sorry for the delay (travel). For some strange reason, a last minute paper I wrote on collecting reliable usage data was accepted for WWW6. The main concerns it raises about hit-metering (which I think is very well stated) include: *) User path data is lost/not collectable. *) It relies upon the cooperation of independent caches. Since this can not be controlled, the amount of gain from implementing the proposal can not be determined. *) Collection periods can not be reliably controlled. Since caches are not forced to report by a certain time, an indeterminable amount of data could be tallied with the next collection period. The usage- limiting mechanisms can help alleviate this, though a) not completely and b) at the cost of more traffic (defeating one of the proposals goals). *) Failure policies are not specified. While the authors readily admit this, the amount of error injected into the system can not be determined. *) As a result of these limitations, comparisons between collection periods can be misleading. Did a 5% decrease have to do with the stuff on the site or a faulty cache, or a network failure, or a report being mis-tallied? I argue that there is no way to reliably know. The paper then outlines various sampling methods for the Web. I argue that: *) IP address based sampling is tricky, if not impossible, to use to generate a random sample. * Randomly sampling users is better. Only perform cache-busting on randomly chosen users. This form of sampling does not suffer from the above hit-metering limitations. *) The amount of confidence to place in the numbers can be determined. *) Comparisons between collection periods are more robust. *) Network failures are correctly handled. *) Path data can be collected. *) User privacy is arguably enhanced. This is definitely the case over current full caching busting, and compared against hit-metering, more information is gathered about fewer users. The paper is temporarily accessible from: http://www.gvu.gatech.edu/t/PAPER126.html I readily admit that it is not stellar paper and was written for non-technical people, so grains of salt are in order. Jim.
Received on Wednesday, 5 March 1997 23:24:15 UTC