- From: Jeffrey Mogul <mogul@pa.dec.com>
- Date: Thu, 17 Oct 96 16:28:01 MDT
- To: Doug Crow <doug_crow@cacheflow.com>
- Cc: http-wg%cuckoo.hpl.hp.com@hplb.hpl.hp.com
I have seen where groups are publishing specWeb96 numbers approaching 1000 URLs per second. Using the common log file format and assuming an average entry of 100 bytes this translates to over 8 Gigabytes of data in a 24 hour period. Given the importance of tracking hit information with respect to advertising revenue and in some cases user profiling, what are servers doing with the size of this log information? I would imagine that very few servers are actually running at the 1000 requests/second rate for a 24 hour period. (Actually, the highest reported SPECweb96 number, from a Digital AlphaServer, is still slightly below 1000 requests/second.) The SPECweb96 rules, which are available at http://www.specbench.org/osg/web96/runrules.html do require that something equivalent to a CLF log entry is written to stable storage for each request. At 1000 request/sec, and using your figure of 100 bytes/request, this works out to 100 KB/sec, which is at least a factor of ten below the rates that modern disk drives can sustain. On the other hand, the rules also say: RUNTIME, the time of measurement for which results are reported, must be the default 600 seconds for reportable results. The WARMUP_TIME must be set to the default of 300 seconds for reportable results. In other words, nobody runs one of these benchmarks for 24 hours, and so nobody has to collect 8 GB of logging info to meet the SPECweb96 rules. The busiest Web site that I know of is still somewhere below 100 million requests/day, the last time I checked. That works out to about 10 GB/day of log info, but this site spreads it out over lots of machines. I would guess that they probably do something like compressing it and storing it offline (I used gzip on a CLF file and got it down to under 7 bytes/request). Or they run it through some software to extract a statistical summary, and then delete the full log. Or both. Is the hit-metering draft that I have seen reference to going to address this? Not if you're referring to the one that Paul Leach and I have been working on. We're working on a heavily revised draft, but it still won't discuss log formats. Phill Hallam has issued some drafts on ways for servers to ask proxies for their logs, and I would imagine that he has considered the costs associated with retrieving tens or hundreds of megabytes of log info like this; the hit-metering work that Paul and I am doing is designed to avoid this kind of thing. Is there a more appropriate forum for a question such as this? I think any discussion of log-exchange over the network had better face up to the log-volume issue. As for what a server does, locally, I think that's probably off-topic for the HTTP-WG. -Jeff
Received on Thursday, 17 October 1996 16:35:42 UTC