Re: Comments on draft-mogul-http-hit-metering-01.txt from James Pitkow on 1997-03-06 (ietf-http-wg@w3.org from January to March 1997)

From: James Pitkow <pitkow@cc.gatech.edu>
Date: Thu, 6 Mar 1997 02:22:11 -0500 (EST)
To: http-wg@cuckoo.hpl.hp.com
Message-Id: <199703060722.CAA29428@hapeville.cc.gatech.edu>

Hello,

"Jeffrey Mogul" wrote at Mar 4, 97 05:57:57 pm:
>  
> I'd surely like to see a well-defined description, including some
> analysis, of these other possible techniques, and perhaps James
> Pitkow's paper (when it becomes available) will shed some light.
> But I'm not interested in continuing a debate of the form "I
> know a better way to do this, but I'm not going to provide a
> detailed argument about why it is better."

   Sorry for the delay (travel).  For some strange reason, a last minute
paper I wrote on collecting reliable usage data was accepted for WWW6.  The 
main concerns it raises about hit-metering (which I think is very well stated) 
include:

   *) User path data is lost/not collectable.

   *) It relies upon the cooperation of independent caches.  Since this can not 
      be controlled, the amount of gain from implementing the proposal can not be 
      determined.

   *) Collection periods can not be reliably controlled. Since caches
      are not forced to report by a certain time, an indeterminable amount of 
      data could be tallied with the next collection period.  The usage-
      limiting mechanisms can help alleviate this, though a) not completely and
      b) at the cost of more traffic (defeating one of the proposals goals). 

   *) Failure policies are not specified.  While the authors readily admit
      this, the amount of error injected into the system can not be determined.

   *) As a result of these limitations, comparisons between collection periods 
      can be misleading.  Did a 5% decrease have to do with the stuff on the site 
      or a faulty cache, or a network failure, or a report being mis-tallied?
      I argue that there is no way to reliably know. 

The paper then outlines various sampling methods for the Web.  I argue that:

   *) IP address based sampling is tricky, if not impossible, to use to generate
      a random sample.  

   * Randomly sampling users is better.  Only perform cache-busting on randomly 
     chosen users.  This form of sampling does not suffer from the above hit-metering
     limitations.

   	*) The amount of confidence to place in the numbers can be determined.

   	*) Comparisons between collection periods are more robust.

   	*) Network failures are correctly handled.
  
	*) Path data can be collected.

   *) User privacy is arguably enhanced.  This is definitely the case over current
      full caching busting, and compared against hit-metering, more information is
      gathered about fewer users.  

The paper is temporarily accessible from:

	http://www.gvu.gatech.edu/t/PAPER126.html

I readily admit that it is not stellar paper and was written for non-technical 
people, so grains of salt are in order.

Jim.

Received on Wednesday, 5 March 1997 23:24:15 UTC