Re: Comments on draft-mogul-http-hit-metering-01.txt from James Pitkow on 1997-03-21 (ietf-http-wg@w3.org from January to March 1997)

From: James Pitkow <pitkow@cc.gatech.edu>
Date: Fri, 21 Mar 1997 02:30:22 -0500 (EST)
To: Jeffrey Mogul <mogul@pa.dec.com>
Cc: pitkow@cc.gatech.edu, http-wg@cuckoo.hpl.hp.com
Message-Id: <199703210730.CAA26498@hapeville.cc.gatech.edu>
Jeff,

   Please pardon my sluggish response time.

>    *) User path data is lost/not collectable.
> 
> Some sorts of path data are lost, but not all.  For example, it
> is pretty simple to structure things so that you can get separate
> counts for each edge of the path-graph.  This can either be
> done by using
> 	Vary: referrer
> or, if that proves to be unreliable, using the specialized
> URL mechanism described in section 9 of the proposal.

   In our analysis of sites, we typically see N x N page-to-page 
connectivity matrices being between 1% and 10% full, with the
percent of paths actually followed being about half of this.  While 
this may seem like a small number, and indeed tends to suck if you 
took the time to design the pages, if I request referrer tallies 
for each page and use hourly collection periods, for our site at
GVU with 20k HTML documents, this translates to potentially 12k referrer 
reports/day for our origin server to process per proxy.  Granted, while 
the data can be collated by the proxy into hourly batches, this still 
creates a nice amount of overhead for both proxies and origin servers, 
especially if this report feature becomes popular.  Scheduling incoming 
reports by the origin server is also an issue, which while doable, 
potentially non-trivial.

   Please don't get me wrong, I'm not whining here, some jammin CPU 
will be putting in the extra hours while I kick it back, but I do want 
to make sure that we respect the additional resources required.

> We don't assert that this captures all path information; for
> example, it doesn't capture second-order paths.  You can
> count the number of times a user got to B from A, and the
> number of times a user got to C from B, but if there are
> other frequent paths to B, you can't count the number of
> times that the path A->B->C was followed (unless you clone
> the pages to generate unique URLs).  Also, these techniques
> tend to reduce the effectiveness of caching.

    Shame too, since this tends to be the more interesting information.

>    *) Collection periods can not be reliably controlled. Since caches
>       are not forced to report by a certain time, an indeterminable
>       amount of data could be tallied with the next collection period.
>       The usage-limiting mechanisms can help alleviate this, though a)
>       not completely and b) at the cost of more traffic (defeating one
>       of the proposals goals).
> 
> The draft mentions, in a Note, that we contemplated introducing
> a "Meter: timeout=NNN" response directive to solve a somewhat
> different problem.  It sounds like this would also solve the
> collection-period problem.  Jim and I have exchanged email about
> this, and it sounds like we both think it would be a good idea.
> I'll add it to the next version, once I figure out the ramifications
> (which are somewhat complicated by the presence of multiple levels
> of proxies).

    Definitely better than having to live with no timeouts.

>    *) As a result of these limitations, comparisons between collection
>    periods can be misleading.  Did a 5% decrease have to do with the stuff
>    on the site or a faulty cache, or a network failure, or a report
>    being mis-tallied?  I argue that there is no way to reliably
>    know.
> 
> True, but this uncertainty applies whether or not one is using
> hit-metering.  E.g., I want to know why the number of references
> to www.shark.com was smaller between 1pm and 2pm than it was between
> noon and 1pm.  Is it because more people surf the net during their
> lunch hours, so more of them find my site?  Or is it because some
> router in Chicago was malfunctioning, and users on the opposite
> coast couldn't make connections?  Since the Internet is inherently
> best-effort, we aren't introducing a qualitatively different level
> of failure-uncertainty.

   Indeed, though I'm of the stance that sampling can alleviate & quantify
the amount of failure-uncertainly injected into the system.  

> However, it's not entirely clear that random-sampled cache-busting is
> free of it's own biases.  For example, if users actually do make fewer
> references to "slow" sites rather than to "fast" ones, and if
> cache-busting increase response times, then the randomly-sampled
> population might behave inherently differently from the full
> population.

   Agreed & no solid data that I am aware of on this issue.  

Jim.
Received on Thursday, 20 March 1997 23:32:01 UTC