- From: Geoffrey Sneddon <foolistbar@googlemail.com>
- Date: Thu, 16 Aug 2007 13:59:46 +0100
- To: Robert Burns <rob@robburns.com>
- Cc: public-html@w3.org
On 16 Aug 2007, at 04:40, Robert Burns wrote: > A scientific approach would involve several things. It would be > conducted with a goal to retrieve unbiased data. That means giving > every HTML document an equal probability of selection. Right now, > you're conducting research based on entries in a Google cache. Its > biased toward pages that want googles attention. Those pages b > behind firewalls,, or on local derives are completely left out of > the research. I don't have any research on this, but I would expect > such pages to often pay more attention to details than the pages > fighting for Google's attention. It would be like looking through > the emails, passing through an email server and concluding that > most emails are about penis enlargement or counterfeit watches. May I ask how you propose on getting a better data set than Google's cache? You're highly unlikely to get data from behind firewalls or local drives. > Genuine scientific statistical research also lays out methodology > and is reproducible. From a scientific perspective, saying I > searched a cache that I have, that you can't search and I won't > even show you the code that produces that cache , would be the same > as me saying the following. "I have this 8-ball and when I ask it > if we should drop @usamap from |input| it tells me 'not likely'. > You may say that sure, 8-balls say that But the odd part is that it > says that every time [cue eerie music]." :-) The point though is > that it can't be reproducible at all if its all based on hidden > data and methods. Again: how are you going to get a better data set? - Geoffrey Sneddon
Received on Thursday, 16 August 2007 12:59:58 UTC