Re: How to certify a log
From: Balachander Krishnamurthy (bala@research.att.com)
Date: Fri, Jan 29 1999
Message-Id: <199901290854.DAA39412@raptor.research.att.com>
To: Jim Pitkow <pitkow@parc.xerox.com>
Cc: www-wca@w3.org
Date: Fri, 29 Jan 1999 03:54:44 -0500
From: Balachander Krishnamurthy <bala@research.att.com>
Subject: Re: How to certify a log
[ok am very confused about what mailing list am supposed to send this to
the one jim's note was sent to w3c-wca@w3.org but that bounced]
jim writes in response to my note
>. Are the fields within the range they should be
> (bogus date/IMS/lmodtime - in the future, invalid response codes..)
> This can be tough and computationally expensive for large logs (i.e., to
> know it's a valid date field one may try to parse it according to a
> specified format using regular expressions of an enumerated list of
> possible values, etc), but needs to be done. Should the accepting
> repository do this work? the initial researchers? or the first users of
> the log from the repository (public service - use it, verify it)?
all my suggestions (requirements?) are for b4 a log getting into a repository
and is used by n others. it is incredibly cheap to clean the logs compared
to dealing with the bogus results one will get otherwise. if people are too
lazy to do this before they check a log into the repository it should go
in the "other" bin (that was discussed on the phone) - am not in favour of
this "other" bin, but am happy to let a part of the repository consist of
random logs in keeping with a consensus spirit.
>
> >. Are the individual values clean to facilitate parsing
> > (embedded '/', control characters, reasonable length etc.)
>
> I typically transform the requested URL into a canonical, escaped path
> before converting to an id.
there are many ways to do this and canonicalizing is a good one. if we were
to come up with a library of routines to clean logs, a canonicalize() would
be a good one to have. this can be somewhat non-trivial since i have run
into embedded newlines (!), bogus "http:" in the *middle* of a URL etc.
> >. Sanity across log:
> > are dates in the range and monotonically increasing
> > distribution of content sizes reasonable
> > % of response codes in expected range? (rare to have non-200/304 > 5%)
>
> Another possibility is to generate descriptive statistics for the fields in
> the log (min, max, mean, mode, stand dev) as well as distributions where
> applicable (file size, inter-request time, etc). My thinking here is that
> a) this will help identify anomalies (this is how I find a bunch) and b)
> help researchers find logs of particular interest (looking for a site with
> a lot of large images).
anja suggested something similar in a private email to me. adding extra
characterization info about a log is good and maybe this can be inserted
into a db to permit quick searches. however, this is separate from certifying
the log.
> What do we want to do about spiders? They can bias results significantly.
toss 'em out depending on the experiment. in our infocom99 paper we did this
(there we were looking at hint generation for future access and spiders
are not likely to care about hints). but one might want them if one were
doing a performance test on the server (paul might care about this).
cheers,
bala