Re: Connection Header from Jeffrey Mogul on 1994-12-21 (ietf-http-wg@w3.org from October to December 1994)

From: Jeffrey Mogul <mogul@pa.dec.com>
Date: Tue, 20 Dec 94 16:00:32 PST
To: Simon E Spero <ses@tipper.oit.unc.edu>
Cc: http-wg%cuckoo.hpl.hp.com@hplb.hpl.hp.com
Message-Id: <9412210000.AA25292@acetes.pa.dec.com>

    Do you have some more info about the data set you ran the traces on? I
    seem to recall you mentioning that the traces were taken from the
    California Election server over a couple of days? Do you have any
    information on the size of the data set, and the relative access
    frequencies of the various documents?
    
Just before I received your message I sent this correction to the
minutes of the BOF:
    This is wrong in several ways:
	Digital have collected some 9 gigabytes of log data for requested URL
	and the duration the connection was kept open. A paper analysing this
	data was presented at the World Wide Web Fall Conference held in
	October 1994, and can be found in the online proceedings.
    
    (1) For the 1994 California Election server, we collected several
    different kinds of data.  We have full logs (including connection
    durations with 1-msec resolution), but these are nowhere near 9 GB.
    We also have a few full days of tcpdump traces; these come to 9 GB
    after compression.  (For privacy reasons, we cannot provide these
    logs or traces to others.)
    
    (2) The paper given at the conference in October obviously could
    not have been based on this data (which was collected in early
    November!)  It is based on simple experiments performed with modified
    clients and servers.
    
To answer your questions a bit more specifically:

The data set size varied over time.  Before 8pm (PST) on election
night, it was mostly "static" pages, such as the text of ballot
propositions, candidate statements, campaign finances, photos of
faces, etc.  After 8pm, we started updating the actual returns every
5 minutes.  (The "returns" pages were online before that, but were
only stand-in pages, nothing useful.)

I don't have the precise sizes handy, but I suppose I should get
someone to do a "find" on that server now.

I haven't gotten around to calculating file size distributions.  To
do this right, I should probably break it down by file type, or even
further. For example, in order to draw bar graphs showing the results,
we generated 101 different .gifs representing values from 0% to 100%.
(See http://www.election.digital.com/e/returns/returns/page.html for
an example.)  We sure hoped that most people browsing these pages
would end up with most of the bar-graph .gifs cached before too long!

I looked at the logs for the busiest day, and the average number of
bytes returned was about 2100-2200 (depending on whether you count
0-byte responses in the average).  Of course, we knew when we constructed
these pages that we were expecting a heavy load, and we tried to keep
things fairly compact.

FYI, not counting the tcpdump traces, we have about 1 gigabyte of
log information, although this includes an awful lot of different
things.  Needless to say, we could be busy for years trying to figure
out what it all means, although by then the Web will be quite different.

-Jeff

Received on Tuesday, 20 December 1994 16:09:53 UTC