- From: Jeffrey Mogul <mogul@pa.dec.com>
- Date: Wed, 12 Mar 97 14:12:01 PST
- To: James Pitkow <pitkow@cc.gatech.edu>
- Cc: http-wg@cuckoo.hpl.hp.com
Thanks for your comments; here are a few replies. *) User path data is lost/not collectable. Some sorts of path data are lost, but not all. For example, it is pretty simple to structure things so that you can get separate counts for each edge of the path-graph. This can either be done by using Vary: referer or, if that proves to be unreliable, using the specialized URL mechanism described in section 9 of the proposal. We don't assert that this captures all path information; for example, it doesn't capture second-order paths. You can count the number of times a user got to B from A, and the number of times a user got to C from B, but if there are other frequent paths to B, you can't count the number of times that the path A->B->C was followed (unless you clone the pages to generate unique URLs). Also, these techniques tend to reduce the effectiveness of caching. *) Collection periods can not be reliably controlled. Since caches are not forced to report by a certain time, an indeterminable amount of data could be tallied with the next collection period. The usage-limiting mechanisms can help alleviate this, though a) not completely and b) at the cost of more traffic (defeating one of the proposals goals). The draft mentions, in a Note, that we contemplated introducing a "Meter: timeout=NNN" response directive to solve a somewhat different problem. It sounds like this would also solve the collection-period problem. Jim and I have exchanged email about this, and it sounds like we both think it would be a good idea. I'll add it to the next version, once I figure out the ramifications (which are somewhat complicated by the presence of multiple levels of proxies). *) As a result of these limitations, comparisons between collection periods can be misleading. Did a 5% decrease have to do with the stuff on the site or a faulty cache, or a network failure, or a report being mis-tallied? I argue that there is no way to reliably know. True, but this uncertainty applies whether or not one is using hit-metering. E.g., I want to know why the number of references to www.shark.com was smaller between 1pm and 2pm than it was between noon and 1pm. Is it because more people surf the net during their lunch hours, so more of them find my site? Or is it because some router in Chicago was malfunctioning, and users on the opposite coast couldn't make connections? Since the Internet is inherently best-effort, we aren't introducing a qualitatively different level of failure-uncertainty. The one thing we are doing that is different is to batch the counts, so that a successful cache-based retrieval might have been delivered but the subsequent report was lost. But in comparison to cache-busting techniques, this decouples the reliability of counting from the reliability of actually providing responses; if cache-busting were widely used, it would reduce the number of responses delivered during periods of network failures. So, yes, cache-busting gives a more accurate count in the face of failures, but it also reduces the perceived reliability of the service. I'd bet that almost all content providers view the reliable delivery of service as their primary reliability requirement, and the reliability of counting takes second place to that. * Randomly sampling users is better. Only perform cache-busting on randomly chosen users. This form of sampling does not suffer from the above hit-metering limitations. *) The amount of confidence to place in the numbers can be determined. It is certainly reasonable to use periods of random-sampled cache-busting to check the accuracy of other approaches. However, it's not entirely clear that random-sampled cache-busting is free of it's own biases. For example, if users actually do make fewer references to "slow" sites rather than to "fast" ones, and if cache-busting increase response times, then the randomly-sampled population might behave inherently differently from the full population. I don't know of any studies that have correlated mean server response time (viewed at the client end) to # of visits per client. I may be able to do this analysis on some of our proxy logs, but this will require a few days at least. If someone knows of an existing study, I'd rather refer to that than to do another log analysis. Your paper points out this problem with respect to day-sampling but not with respect to user-sampling. While it may be possible to correct for some of this effect by comparing the statistics for sampled users and non-sampled users, if you can only get page-reuse counts by disabling the caches, then it might be very hard to get an unperturbed baseline for this statistic. The hit-metering proposal solves this problem by allowing counting of reuses without substantially changing cache performance (this depends of course on how widely implemented it becomes). *) User privacy is arguably enhanced. This is definitely the case over current full caching busting, and compared against hit-metering, more information is gathered about fewer users. By the way, in spite of Jim's apologies that his paper is not "stellar", I think overall he has done a very nice job, and I encourage people to read the paper ("temporarily accessible from: http://www.gvu.gatech.edu/t/PAPER126.html") -Jeff
Received on Wednesday, 12 March 1997 14:22:23 UTC