Re: HTTP Log files

    I have seen where groups are publishing specWeb96 numbers
    approaching 1000 URLs per second.  Using the common log file format
    and assuming an average entry of 100 bytes this translates to over
    8 Gigabytes of data in a 24 hour period.

    Given the importance of tracking hit information with respect to
    advertising revenue and in some cases user profiling, what are
    servers doing with the size of this log information?

I would imagine that very few servers are actually running at
the 1000 requests/second rate for a 24 hour period.  (Actually,
the highest reported SPECweb96 number, from a Digital AlphaServer,
is still slightly below 1000 requests/second.)

The SPECweb96 rules, which are available at
	http://www.specbench.org/osg/web96/runrules.html
do require that something equivalent to a CLF log entry is
written to stable storage for each request.  At 1000 request/sec,
and using your figure of 100 bytes/request, this works out to
100 KB/sec, which is at least a factor of ten below the rates
that modern disk drives can sustain.

On the other hand, the rules also say:
    RUNTIME, the time of measurement for which results are reported,
    must be the default 600 seconds for reportable results. The
    WARMUP_TIME must be set to the default of 300 seconds for
    reportable results.
In other words, nobody runs one of these benchmarks for 24 hours,
and so nobody has to collect 8 GB of logging info to meet the
SPECweb96 rules.

The busiest Web site that I know of is still somewhere below
100 million requests/day, the last time I checked.  That works
out to about 10 GB/day of log info, but this site spreads it
out over lots of machines.

I would guess that they probably do something like compressing
it and storing it offline (I used gzip on a CLF file and got it
down to under 7 bytes/request).  Or they run it through some
software to extract a statistical summary, and then delete the
full log.  Or both.

    Is the hit-metering draft that I have seen reference to going to
    address this?

Not if you're referring to the one that Paul Leach and I have
been working on.  We're working on a heavily revised draft, but
it still won't discuss log formats.  Phill Hallam has issued
some drafts on ways for servers to ask proxies for their logs,
and I would imagine that he has considered the costs associated
with retrieving tens or hundreds of megabytes of log info like
this; the hit-metering work that Paul and I am doing is designed
to avoid this kind of thing.

    Is there a more appropriate forum for a question such as this?

I think any discussion of log-exchange over the network had better
face up to the log-volume issue.  As for what a server does, locally,
I think that's probably off-topic for the HTTP-WG.

-Jeff



    

Received on Thursday, 17 October 1996 16:35:42 UTC