Re: What is the Web? - effect on metrics
From: Johan Hjelm (hjelm@w3.org)
Date: Mon, Mar 08 1999
Message-Id: <4.1.19990308175427.00c0c690@127.0.0.1>
Date: Mon, 08 Mar 1999 19:33:03 +0100
To: Jim Pitkow <pitkow@parc.xerox.com>
From: Johan Hjelm <hjelm@w3.org>
Cc: "'www-wca@w3.org'" <www-wca@w3.org>
Subject: Re: What is the Web? - effect on metrics
If we limit ourselves to what can be measured from HTTP servers, our task
becomes easier. However, I would think that we should look at other
protocols too. In particular, I am concerned about WAP, but there may be
other protocols we should analyse.
Maybe we should take a reverse view, and say: If the protocol is a
request-response protocol, and the server can generate a log file that
contains a set of metrics as defined below, it can be characterised. Or
something in that vein.
In the light of that, here is a second go on the metrics:
>>Comments about metrics:
>>Classification of users (educational, home, ISP, or corporate)
>I think these days 'home' and 'ISP' are a bit redundant, with the ISP and
corporate sectors dominating traffic.
It seems to me that this can either be fairly detailed, and give a
reasonably detailed classification, or it can be a rather rough
classification (educational, governement, ISP, corporate). Whichever, we
are making the assumption that there are a number of traffic patterns that
matches a certain group. In the interest of normalizing the technology,
this is an item which the log file provider would supply themselves, from a
drop-down menu maybe.
See also below about which type of log it is.
>>Roles within the classification: Sales, tech support, engineering;
>>teenager, parent, toddler, schoolchild; etc. (Note: Do we need to develop a
>>standard syntax/vocabulary for things like these?)
>We could use SIC business codes, though I'm not sure that this level of
>granularity gives us much today in terms of analysis. It may be simpler to
>leave this out.
See above
>>Domain of server (gTLDs and top domains are all very well, but which
>>language is it in? Some languages, e.g. Swedish, span several domains -
>>.se, .nu, .com are frequently used for servers. How do we handle
>>mixed-language servers? Servers that are in several domains (e.g.
>>ericsson.com, is also ericsson.se and ericsson.nl, and the content is the
>>same, to give just one example)
>I agree that language is an important field to capture, but can probably be
>separated out from domain. For domain, I think listing all domains served
>(including vhosting, etc) is all that is required.
OK, and then language can be a property of the web pages? If nothing else,
it would be interesting to compare the use (both in absolute figures and
related to the users location) of the pages in different languages.
>>Cost and other access restrictions (e.g. IP-based access masks, robot.txt,
>etc?)
>>Access method of users (LAN, modem, mobile, or wireless)
>>Access network (For wireless: GSM, HDSC, CDPD, CDMA, Mobitex, W-CDMA,
>>PHS/PIAFS etc). (How do you handle sites with mixed networks?)
This would also be something the person giving the information would have
to fill in?
>>Users, response rate, and attrition rate (this does not sound like a log
>>file analysis, rather like a survey?).
>Yes, all of the above are important meta-data. The last metrics and the
>ones below can be ascertained by analyzing the logs directly.
>>Pages transfer per user (How do you distinguish a page, if you are using
>>frames?)
>>[...]
>
>>I am assuming we have three different metafiles: A file describing the set
>>and the setup, a file describing the site, and a file describing the log
>>file. Below, I have tried to divide the metrics into these three:
>>
>Meta-Set:
>>Location of the log files
>Do you mean a URI?
No, I meant in the file system. But a URI is better.
>>Location of the metafiles
>>Location of the server (site) data (which may be different from the root
>>file system)
>I don;'t understand this distinction.
If a "server" is a content area, it makes sense to categorise only the part
of it that is contained in that content area. It may have a location deep
down in the file tree. In the case of virtual hosting, the root of the
server would start at some level below the root of the file system of the
machine. Maybe an unnecessary distinction, but it relates to virtual hosts.
>>Periodicity of the analysis: Log files, server file system
>Time range and sampling rate are probably more exact.
>>Classification of users and user roles
>>Access methods of users (and method of generating sessions, e.g based on
>>access method)
>>Access network (same question)
>>Domain of server (Language question)
>Language + DNS
See above
>>Cost and other access restrictions (e.g. IP-based access masks, robot.txt,
etc)
>>Type of service provider
>>Birth and modification history of server (e.g. major revisions of content)
>Some of this may be too much too much to ask. We'll have to see what push
>back we get. It's be nice to have as light a weight a process as possible.
Well, it's Jims list to start with. But I think it is sensible to ask a
series of questions when the analysis is set up (e.g. birth date), and ask
the site owner to enter dates when content is modified (if this does not
happen continously), since that could explain changes in the log that
depend on new users coming in, or users modifying their access patterns,
when the site was changed.
>>All the below should be automatically extracted upon submission to the
>>repository.
>>Meta-log:
>>Files transferred per user (total)
>Assumes we have a common method to define user. This will take some doing
>and will depend upon the level of user tracking allowed, i.e., cookies,
>sessions, referrer tracking, domain names, etc. Also, these should be
>reported as distributions with mean, and variance - with a possible random
>subsample used to characterize the distribution.
>>Unique files transferred per user
>>Pages transferred per user
>>Unique pages transferred per user
>>Sites visited per user (assumes longditudinal trackning)
>Assumes that the logs are from client or proxy traces as opposed to server
>access log files. We need to differential each class of logs and the
>corresponding metrics.
We need to differentiate client, proxy, and origin server logs. Are there
any other distinctions we need to make (cache)?
>>Reoccurence rates for files and pages per user
>>Protocol percentage breakdown (e.g. HTTP, SHTTP, Gopher, etc).
>>Number of sessions per user
>>Length of session per user
>>Inter-session time per user (session-to-session time)
>>Stack distance per user
>>Inter-request time per user (request-to-request time)
>>Intra-request time per user (request-to-render time)
>>Length of visit per user
>>
>>Meta-site:
>>Number of embedded images per page, file type, and size
>Again, these are better reported as distributions.
>>Mime-type percentage breakdown of site (e.g. HTML, JPEG, PS, etc)
>>Hyperlinks per HTTP page
>>Site Composition (once per measurement session)
>I don't; understand what this means.
I was thinking about how the site is organised in directories, etc. Maybe
better served as distributions.
>>Number of users
>>Number of files and page requests per user
>>Number of search engine hits
>>Number of files serviced
>>Number of pages serviced
>>Number of CGI/dynamic content serviced
>>Bytes transferred
>>Byte latency
>>Total number of files on server
>>Documents by Traffic graph ( x% documents account for y% of traffic)
>>Growth Rates (once per measurement session)
>>Number of users
>>Number of files and page requests per user
>>Number of files serviced
>>Number of pages serviced
>>Number of CGI/dynamic content serviced
>>Bytes transferred
>>Byte latency
>>Number of files on server
>>Number of bytes on server
>>Doubling period for all of the above metrics
************************************************************
Johan HJELM
Ericsson Research, User Applications Group
Currently visiting engineer at the W3C
The World Wide Web Consortium
hjelm@w3.org
http://www.w3.org/People/W3Cpeople.html#Hjelm
Fax +1-617-258 5999, Phone +1-617-263-9630
MIT/LCS, 545 Tech. Sq. Cambridge MA 02139 USA
opinions are personal, always my own,
and not necessarily those of Ericsson or the W3C.
============================================================