Re: How to certify a log

From: Jim Pitkow (pitkow@parc.xerox.com)
Date: Fri, Jan 29 1999


Message-Id: <4.1.19990129015820.00971a20@mailback.parc.xerox.com>
Date: Fri, 29 Jan 1999 01:58:35 PST
To: www-wca@w3.org
From: Jim Pitkow <pitkow@parc.xerox.com>
Subject: Re: How to certify a log

At 01:33 PM 1/27/99 , Balachander Krishnamurthy wrote:
>. Are the fields within the range they should be 
>	(bogus date/IMS/lmodtime - in the future, invalid response codes..)

This can be tough and computationally expensive for large logs (i.e., to
know it's a valid date field one may try to parse it according to a
specified format using regular expressions of an enumerated list of
possible values, etc), but needs to be done.  Should the accepting
repository do this work?  the initial researchers?  or the first users of
the log from the repository (public service - use it, verify it)?

>. Are the individual values clean to facilitate parsing
>	(embedded '/', control characters, reasonable length etc.)

I typically transform the requested URL into a canonical, escaped path
before converting to an id.

>. Sanity across log: 
>	are dates in the range and monotonically increasing
>	distribution of content sizes reasonable
>	% of response codes in expected range? (rare to have non-200/304 > 5%)

Another possibility is to generate descriptive statistics for the fields in
the log (min, max, mean, mode, stand dev) as well as distributions where
applicable (file size, inter-request time, etc).  My thinking here is that
a) this will help identify anomalies (this is how I find a bunch) and b)
help researchers find logs of particular interest (looking for a site with
a lot of large images).

What do we want to do about spiders?  They can bias results significantly.