- From: James Graham <jg307@cam.ac.uk>
- Date: Thu, 16 Aug 2007 14:31:12 +0100
- To: Geoffrey Sneddon <foolistbar@googlemail.com>
- CC: Robert Burns <rob@robburns.com>, public-html@w3.org
> On 16 Aug 2007, at 04:40, Robert Burns wrote: > >> A scientific approach would involve several things. It would be >> conducted with a goal to retrieve unbiased data. That means giving >> every HTML document an equal probability of selection. FWIW it is seldom possible to select unbiased data in "scientific studies". Consider astronomy, for example, where surveys of particular types of object are typically limited by the sensitivity of the instrument being used (so you can only detect faint things if they are nearby), by the angular resolution available (so you cannot distinguish objects that are close together on the sky) and a variety of other factors which depend on what you are trying to measure. Nevertheless it is possible to make progress in our understanding of the universe through analysis of astronomical surveys. All that's required is care in interpreting the data. >> Genuine scientific statistical research also lays out methodology and >> is reproducible. It is actually often surprisingly difficult to reproduce scientific studies. For example it is considered perfectly permissible to publish scientific studies based on closed source code running on proprietary hardware. However there is generally enough methodology documented that someone could in-principle reproduce the results by making a similar study with their own code on their own system. From a scientific perspective, saying I searched a >> cache that I have, that you can't search and I won't even show you the >> code that produces that cache , would be the same as me saying the >> following. "I have this 8-ball and when I ask it if we should drop >> @usamap from |input| it tells me 'not likely'. You may say that sure, >> 8-balls say that But the odd part is that it says that every time [cue >> eerie music]." :-) The point though is that it can't be reproducible >> at all if its all based on hidden data and methods. It's neither based on hidden data nor a hidden method. The data is all publicly accessible webpages. The methodology is a) spider the webpages, b) run the parsing algorithms in the html 5 spec over the resulting files c) extract whatever data is of interest. That seems in-principle pretty straight forward to me and at-least as reproducible as many peer reviewed scientific studies. Indeed Phillip Taylor has already managed to reproduce the procedure on a smaller dataset and thus independently verified many of Hixie's results. -- "Eternity's a terrible thought. I mean, where's it all going to end?" -- Tom Stoppard, Rosencrantz and Guildenstern are Dead
Received on Thursday, 16 August 2007 13:31:24 UTC