Re: Comments on draft-mogul-http-hit-metering-01.txt from Jeffrey Mogul on 1997-03-12 (ietf-http-wg@w3.org from January to March 1997)

From: Jeffrey Mogul <mogul@pa.dec.com>
Date: Wed, 12 Mar 97 14:12:01 PST
To: James Pitkow <pitkow@cc.gatech.edu>
Cc: http-wg@cuckoo.hpl.hp.com
Message-Id: <9703122212.AA18808@acetes.pa.dec.com>
Thanks for your comments; here are a few replies.

   *) User path data is lost/not collectable.

Some sorts of path data are lost, but not all.  For example, it
is pretty simple to structure things so that you can get separate
counts for each edge of the path-graph.  This can either be
done by using
	Vary: referer
or, if that proves to be unreliable, using the specialized
URL mechanism described in section 9 of the proposal.

We don't assert that this captures all path information; for
example, it doesn't capture second-order paths.  You can
count the number of times a user got to B from A, and the
number of times a user got to C from B, but if there are
other frequent paths to B, you can't count the number of
times that the path A->B->C was followed (unless you clone
the pages to generate unique URLs).  Also, these techniques
tend to reduce the effectiveness of caching.

   *) Collection periods can not be reliably controlled. Since caches
      are not forced to report by a certain time, an indeterminable
      amount of data could be tallied with the next collection period.
      The usage-limiting mechanisms can help alleviate this, though a)
      not completely and b) at the cost of more traffic (defeating one
      of the proposals goals).

The draft mentions, in a Note, that we contemplated introducing
a "Meter: timeout=NNN" response directive to solve a somewhat
different problem.  It sounds like this would also solve the
collection-period problem.  Jim and I have exchanged email about
this, and it sounds like we both think it would be a good idea.
I'll add it to the next version, once I figure out the ramifications
(which are somewhat complicated by the presence of multiple levels
of proxies).

   *) As a result of these limitations, comparisons between collection
   periods can be misleading.  Did a 5% decrease have to do with the stuff
   on the site or a faulty cache, or a network failure, or a report
   being mis-tallied?  I argue that there is no way to reliably
   know.

True, but this uncertainty applies whether or not one is using
hit-metering.  E.g., I want to know why the number of references
to www.shark.com was smaller between 1pm and 2pm than it was between
noon and 1pm.  Is it because more people surf the net during their
lunch hours, so more of them find my site?  Or is it because some
router in Chicago was malfunctioning, and users on the opposite
coast couldn't make connections?  Since the Internet is inherently
best-effort, we aren't introducing a qualitatively different level
of failure-uncertainty.

The one thing we are doing that is different is to batch the
counts, so that a successful cache-based retrieval might have
been delivered but the subsequent report was lost.  But in
comparison to cache-busting techniques, this decouples the
reliability of counting from the reliability of actually providing
responses; if cache-busting were widely used, it would reduce the
number of responses delivered during periods of network failures.
So, yes, cache-busting gives a more accurate count in the face
of failures, but it also reduces the perceived reliability of the
service.  I'd bet that almost all content providers view the
reliable delivery of service as their primary reliability requirement,
and the reliability of counting takes second place to that.

   * Randomly sampling users is better.  Only perform cache-busting on
   randomly chosen users.  This form of sampling does not suffer from the
   above hit-metering limitations.

   	*) The amount of confidence to place in the numbers can be determined.

It is certainly reasonable to use periods of random-sampled
cache-busting to check the accuracy of other approaches.

However, it's not entirely clear that random-sampled cache-busting is
free of it's own biases.  For example, if users actually do make fewer
references to "slow" sites rather than to "fast" ones, and if
cache-busting increase response times, then the randomly-sampled
population might behave inherently differently from the full
population.

I don't know of any studies that have correlated mean server
response time (viewed at the client end) to # of visits per
client.  I may be able to do this analysis on some of our
proxy logs, but this will require a few days at least.  If someone
knows of an existing study, I'd rather refer to that than to
do another log analysis.

Your paper points out this problem with respect to day-sampling
but not with respect to user-sampling.  While it may be possible
to correct for some of this effect by comparing the statistics
for sampled users and non-sampled users, if you can only get
page-reuse counts by disabling the caches, then it might be
very hard to get an unperturbed baseline for this statistic.
The hit-metering proposal solves this problem by allowing
counting of reuses without substantially changing cache performance
(this depends of course on how widely implemented it becomes).

   *) User privacy is arguably enhanced.  This is definitely the case
   over current full caching busting, and compared against hit-metering,
   more information is gathered about fewer users.

By the way, in spite of Jim's apologies that his paper is not "stellar",
I think overall he has done a very nice job, and I encourage people
to read the paper ("temporarily accessible from: 
http://www.gvu.gatech.edu/t/PAPER126.html")

-Jeff
Received on Wednesday, 12 March 1997 14:22:23 UTC