Re: Requirements for research (Was: Dropping <input usemap="">) from James Graham on 2007-08-16 (public-html@w3.org from August 2007)

From: James Graham <jg307@cam.ac.uk>
Date: Thu, 16 Aug 2007 14:31:12 +0100
To: Geoffrey Sneddon <foolistbar@googlemail.com>
CC: Robert Burns <rob@robburns.com>, public-html@w3.org
Message-ID: <46C451A0.4040706@cam.ac.uk>

> On 16 Aug 2007, at 04:40, Robert Burns wrote:
> 
>> A scientific approach would involve several things. It would be 
>> conducted with a goal to retrieve unbiased data. That means giving 
>> every HTML document an equal probability of selection.

FWIW it is seldom possible to select unbiased data in "scientific studies". 
Consider astronomy, for example, where surveys of particular types of object are 
typically limited by the sensitivity of the instrument being used (so you can 
only detect faint things if they are nearby), by the angular resolution 
available (so you cannot distinguish objects that are close together on the sky) 
and a variety of other factors which depend on what you are trying to measure. 
Nevertheless it is possible to make progress in our understanding of the 
universe through analysis of astronomical surveys. All that's required is care 
in interpreting the data.

>> Genuine scientific statistical research also lays out methodology and 
>> is reproducible.

It is actually often surprisingly difficult to reproduce scientific studies. For 
example it is considered perfectly permissible to publish scientific studies 
based on closed source code running on proprietary hardware. However there is 
generally enough methodology documented that someone could in-principle 
reproduce the results by making a similar study with their own code on their own 
system.

  From a scientific perspective, saying I searched a
>> cache that I have, that you can't search and I won't even show you the 
>> code that produces that cache , would be the same as me saying the 
>> following. "I have this 8-ball and when I ask it if we should drop 
>> @usamap from |input| it tells me 'not likely'. You may say that sure, 
>> 8-balls say that But the odd part is that it says that every time [cue 
>> eerie music]." :-) The point though is that it can't be reproducible 
>> at all if its all based on hidden data and methods.

It's neither based on hidden data nor a hidden method. The data is all publicly 
accessible webpages. The methodology is a) spider the webpages, b) run the 
parsing algorithms in the html 5 spec over the resulting files c) extract 
whatever data is of interest. That seems in-principle pretty straight forward to 
me and at-least as reproducible as many peer reviewed scientific studies. Indeed 
Phillip Taylor has already managed to reproduce the procedure on a smaller 
dataset and thus independently verified many of Hixie's results.

-- 
"Eternity's a terrible thought. I mean, where's it all going to end?"
  -- Tom Stoppard, Rosencrantz and Guildenstern are Dead

Received on Thursday, 16 August 2007 13:31:24 UTC