Re: Metrics questions
From: Jim Pitkow (pitkow@parc.xerox.com)
Date: Thu, Mar 04 1999
Message-Id: <4.1.19990303234324.009a8b30@mailback.parc.xerox.com>
Date: Wed, 3 Mar 1999 23:58:51 PST
To: Johan Hjelm <hjelm@w3.org>, www-wca@w3.org
From: Jim Pitkow <pitkow@parc.xerox.com>
Subject: Re: Metrics questions
At 08:30 AM 2/26/99 , Johan Hjelm wrote:
>Comments about metrics:
>Classification of users (educational, home, ISP, or corporate)
I think these days 'home' and 'ISP' are a bit redundant, with the ISP and
corporate sectors dominating traffic.
>Roles within the classification: Sales, tech support, engineering;
>teenager, parent, toddler, schoolchild; etc. (Note: Do we need to develop a
>standard syntax/vocabulary for things like these?)
We could use SIC business codes, though I'm not sure that this level of
granularity gives us much today in terms of analysis. It may be simpler to
leave this out.
>Domain of server (gTLDs and top domains are all very well, but which
>language is it in? Some languages, e.g. Swedish, span several domains -
>.se, .nu, .com are frequently used for servers. How do we handle
>mixed-language servers? Servers that are in several domains (e.g.
>ericsson.com, is also ericsson.se and ericsson.nl, and the content is the
>same, to give just one example)
I agree that language is an important field to capture, but can probably be
separated out from domain. For domain, I think listing all domains served
(including vhosting, etc) is all that is required.
>Cost and other access restrictions (e.g. IP-based access masks, robot.txt,
>etc?)
>Access method of users (LAN, modem, mobile, or wireless)
>Access network (For wireless: GSM, HDSC, CDPD, CDMA, Mobitex, W-CDMA,
>PHS/PIAFS etc). (How do you handle sites with mixed networks?)
>Users, response rate, and attrition rate (this does not sound like a log
>file analysis, rather like a survey?).
Yes, all of the above are important meta-data. The last metrics and the
ones below can be ascertained by analyzing the logs directly.
>Pages transfer per user (How do you distinguish a page, if you are using
>frames?)
>[...]
>
>I am assuming we have three different metafiles: A file describing the set
>and the setup, a file describing the site, and a file describing the log
>file. Below, I have tried to divide the metrics into these three:
>
>Meta-Set:
>Location of the log files
Do you mean a URI?
>Location of the metafiles
>Location of the server (site) data (which may be different from the root
>file system)
I don;'t understand this distinction.
>Periodicity of the analysis: Log files, server file system
Time range and sampling rate are probably more exact.
>Classification of users and user roles
>Access methods of users (and method of generating sessions, e.g based on
>access method)
>Access network (same question)
>Domain of server (Language question)
Language + DNS
>Cost and other access restrictions (e.g. IP-based access masks, robot.txt,
etc)
>Type of service provider
>Birth and modification history of server (e.g. major revisions of content)
Some of this may be too much too much to ask. We'll have to see what push
back we get. It's be nice to have as light a weight a process as possible.
All the below should be automatically extracted upon submission to the
repository.
>Meta-log:
>Files transferred per user (total)
Assumes we have a common method to define user. This will take some doing
and will depend upon the level of user tracking allowed, i.e., cookies,
sessions, referrer tracking, domain names, etc. Also, these should be
reported as distributions with mean, and variance - with a possible random
subsample used to characterize the distribution.
>Unique files transferred per user
>Pages transferred per user
>Unique pages transferred per user
>Sites visited per user (assumes longditudinal trackning)
Assumes that the logs are from client or proxy traces as opposed to server
access log files. We need to differential each class of logs and the
corresponding metrics.
>Reoccurence rates for files and pages per user
>Protocol percentage breakdown (e.g. HTTP, SHTTP, Gopher, etc).
>Number of sessions per user
>Length of session per user
>Inter-session time per user (session-to-session time)
>Stack distance per user
>Inter-request time per user (request-to-request time)
>Intra-request time per user (request-to-render time)
>Length of visit per user
>
>Meta-site:
>Number of embedded images per page, file type, and size
Again, these are better reported as distributions.
>Mime-type percentage breakdown of site (e.g. HTML, JPEG, PS, etc)
>Hyperlinks per HTTP page
>Site Composition (once per measurement session)
I don't; understand what this means.
>Number of users
>Number of files and page requests per user
>Number of search engine hits
>Number of files serviced
>Number of pages serviced
>Number of CGI/dynamic content serviced
>Bytes transferred
>Byte latency
>Total number of files on server
>Documents by Traffic graph ( x% documents account for y% of traffic)
>Growth Rates (once per measurement session)
>Number of users
>Number of files and page requests per user
>Number of files serviced
>Number of pages serviced
>Number of CGI/dynamic content serviced
>Bytes transferred
>Byte latency
>Number of files on server
>Number of bytes on server
>Doubling period for all of the above metrics
This is shaping up. It'd be nice to release a set of recommendations for
client/proxy and server side characterization reporting along with some
reference code (I'm almost there getting a preliminary version released).
This could be useful to help influence reporting within papers and for the
repository.