Web Characterization Metrics
April 15, 1999
Editor: Brian Lavoie href="http://www.oclc.org/">(OCLC)
Every Web metric should be associated with a tuple that specifies >the data collection unit, and its scope in time and space:
Element: the Web object for which data is collected.
Examples:
User, Web client, Web server, Web site, Web >page,
Web collection, the entire Web. See the href="http://www.oclc.org/oclc/research/projects/webstats/currterms.ht>m">terminology
sheet for definitions of these Web objects.
Population Scope: the element population to which the metric applies.
Examples:
Entire Web, Web site, Web page ...
Temporal Scope: the time frame implicit in the metric.
Examples:
static measures, dynamic measures >(including
rate of change, doubling period).
Metric property examples:
Metric: User classification
Tuple: <user, all users at W3C, May 1999>
Metric: Mime-type distribution
Tuple: <Web page, pages on W3C Web site, June 1999>
Metric: Growth of the Web
Tuple: <Web site, entire Web, June 1998-June1999>
By recognizing that this tuple exists for every Web metric, it is >not necessary to make a separate listing for static and dynamic versions of the same metric, or for metrics applied to different populations.
The metrics listed below are grouped according to element.
Data Source
Sample data (see href="http://www.oclc.org/oclc/research/publications/review97/oneill/o>'neillar980213.htm">"A
Methodology for Sampling the World Wide Web").
Metrics
Number of Web servers (See "Web Server Taxonomy" section of >terminology
sheet for more details).
Number of Web sites
Number of unique Web sites (e.g., filter out Web sites located at >multiple IP addresses)
Number of Web pages
Number of Web collections
Number of bytes
Network traffic (e.g., bytes transferred, Web pages accessed, etc.)
Ratio of size of core to size of periphery
Percentage breakdown of protocols across the periphery
Data Source
Survey data
Metrics
User classification (adult, child, professional user, casual user,
etc.)
User access method (ISP, dial-up modem, wireless network, etc.)
User response rate and attrition rate
Data filtering imposed by user (i.e., which client filters have been activated by the user; see client side-filtering below)
Files transferred per user
Unique files transferred per user
Pages transferred per user
Unique pages transferred per user
Web sites visited per user
Unique Web sites visited per user
Reoccurrence rates for files, pages, and sites
Sessions per user per time period
Temporal length of sessions per user
Inter-session time per user (session to session time)
Path length of sessions per user
Stack distance per user
Inter-request time per user (request to request time)
Intra-request time per user (request to render time)
Temporal length of visit per site per user
Path length of visit per site per user
Ratio of explicit clicks to implicit clicks, per user per session
Ratio of embedded clicks to user-supplied clicks, per user per >session
Data Source
Log files
Metrics
Type of client (browser, robot, etc.)
Renderable mime-types
Java-enabled (yes, no)
Click-generation functionality (address window, favorites list, >history list, etc.)
HTML fluency (i.e., what is the latest version of HTML recognized by the client?)
Client-side filtering capability (Internet content ratings, >certificates,
etc.)
Data Source
Log files
Metrics
Internet node identification (IP address and port)
Domain name (and aliases)
Other Internet nodes mapped to same domain name
HTTP node classification (inaccessible, redirection, accessible; >these classifications will be time-sensitive; see volatility metric below)
Top-level domain (e.g., .com, .edu, etc.)
Geographical location
Number of subsites (i.e., single Web site on server, or host site >with subsites (virtual hosting))
Server-side filtering (e.g., robots.txt, firewalls, etc.)
Number of files on server
Number of Web pages on server
Files/pages by traffic graph (e.g., % of files/pages account for % >of traffic)
Volatility level (summarizing the accessibility of the server during a given time period)
Ratio of explicit clicks to implicit clicks for server
Discussion:
I don't think that servers metrics should >include
metrics that relate to content. Content metrics should be confined to >the
Web resources (as discussed below). For example, the server metrics >list
previously included a "modification of content history" metric - this >should
be applied to the specific resource containing the content (e.g., a Web
page).
Data Source
Sample data; log files
Metrics
Web site publisher
North American Industrial Classification System (NAICS) code for >publisher
Textual description of site's content
Content access scheme (free, pay-per-view, subscription, etc.)
Number of Web pages
Number and type of Web collections
Number of user Web page requests per time period
Number of search engines indexing the site
Number of pages served per time period
Percentage of site devoted to CGI/dynamic content
Bytes transferred per time period
Byte latency
Birth and modification history (major revisions of content - from >HTTP header?)
Cookie supplied (yes, no)
Depth (number of levels in site's internal link structure)
Data Source
Sample data; log files
Metrics
Aggregate size of constituent Web resources (in bytes)
Number and type of embedded non-text objects (images, video, >streaming data, applets, etc.)
Hyperlinks per page
Percentage breakdown of mime types in hyperlinks (e.g., html, jpg, >ps, etc.)
Percentage breakdown of protocols in hyperlinks (e.g., http, shttp, gopher, etc.)
Ratio of internal to external links on page
Textual description of page's content
Content access scheme (free, pay-per-view, subscription, etc.)
Birth and modification history (major revisions of content - from >HTTP
header?)
Data Source
Sample data; log files
Metrics
Type of collection (online journal, photo gallery, etc.)
Content access scheme (free, pay-per-view, subscription, etc.)
Number of Web pages in collection
Birth and modification history (major revisions of content - from >HTTP
header?)