Re: repository requirements

From: Jim Pitkow (pitkow@parc.xerox.com)
Date: Tue, Apr 20 1999


Date: Tue, 20 Apr 1999 13:46:04 PDT
To: "www-wca@w3.org" <www-wca@w3.org>
From: Jim Pitkow <pitkow@parc.xerox.com>
Message-Id: <99Apr20.134621pdt."361693"@louise.parc.xerox.com>
Subject: Re: repository requirements


Marc, thanks for your responses.

At 11:21 AM 4/20/99 , Marc Abrams wrote:
>All log files used by anyone writing a paper should be in the uncertified
>section of the repository.  Why?  If I read someone's paper and question the
>analysis done, I would like to have access to the data used!  

I agree - having access to the logs other researchers report in papers
promotes good scientific rigor.  If they submit the log though, we should
be able to certify it, no?

>Certainly for papers published in the pre-certification days access to the
>(uncertified) logs used is desirable.  (Or do we retroactively certify all
>logs used in the past?)

It'd be nice to do the later if possible.

>Final point.  Suppose I am a researcher and I want to do a study on X.  Turns
>out there is no certified log in the wca repository on x.  What do I do?  (1)
>Wait until a log gets certified and then do my study, or (2) just do the
study
>and try to get the log certified later?  If the world does (2), and the log
>doesn't wind up certified, we again have a paper in the literature for which
>the original trace data is unavailable.

1) The certification process should be fairly quick (order days since we
are attempting to automate parts of this), no?  If so, it seems to me that
this delay should not make or break anyone's research time frame.

2) I like your thought experiment of pushing to see under what cases logs
will not get certified.  Here are some of the reasons I can think of:
	a) researcher can not submit the logs due to proprietary nature - this
already happens a lot as it is today, so there is no realized gain/loss.
	b) the data is found to be inconsistent/erroneous - if this is the case,
then all results based on this data are suspect at best
	c) incomplete meta-data, methodological description, etc. which can be
rectified with the submitting researcher.
	d) other reasons?

Another issue is what to do with continuous data feeds, for example the
daily caching logs from NLANR?  Once automated, we should be able to
certify each daily set of logs and provide cleaned versions if necessary.