Re: Requirements for research (Was: Dropping <input usemap="">) from Robert Burns on 2007-08-16 (public-html@w3.org from August 2007)

From: Robert Burns <rob@robburns.com>
Date: Thu, 16 Aug 2007 08:39:09 -0500
To: Geoffrey Sneddon <foolistbar@googlemail.com>
Cc: public-html@w3.org
Message-Id: <78DE4B55-A90E-4146-A4D7-9298171C7400@robburns.com>
Hi Geoffrey,

On Aug 16, 2007, at 7:59 AM, Geoffrey Sneddon wrote:

> On 16 Aug 2007, at 04:40, Robert Burns wrote:
>
>> A scientific approach would involve several things. It would be  
>> conducted with a goal to retrieve unbiased data. That means giving  
>> every HTML document an equal probability of selection. Right now,  
>> you're conducting research based on entries in a Google cache. Its  
>> biased toward pages that want googles attention. Those pages b  
>> behind firewalls,, or on local derives are completely left out of  
>> the research. I don't have any research on this, but I would  
>> expect such pages to often pay more attention to details than the  
>> pages fighting for Google's attention. It would be like looking  
>> through the emails, passing through an email server and concluding  
>> that most emails are about penis enlargement or counterfeit watches.
>
> May I ask how you propose on getting a better data set than  
> Google's cache? You're highly unlikely to get data  from behind  
> firewalls or local drives.

Scientific evidence ain't cheap. That's part of my point. However,  
there's many reasons to think that going after the low-hanging fruit  
with our statistics is biasing the results substantially (and is  
certainly not a scientific or evidence producing approach). Outside  
firewalls is that no-mans land of the internet that we all love. Its  
filled with spammers and porn and all sorts of strange characters.  
Behind the firewalls and on our local drives is likely to be tamer  
and maybe even more standards focussed markup.

>> Genuine scientific statistical research also lays out methodology  
>> and is reproducible.  From a scientific perspective, saying I  
>> searched a cache that I have, that you can't search and I won't  
>> even show you the code that produces that cache , would be the  
>> same as me saying the following. "I have this 8-ball and when I  
>> ask it if we should drop @usamap from |input| it tells me 'not  
>> likely'. You may say that sure, 8-balls say that But the odd part  
>> is that it says that every time [cue eerie music]." :-) The point  
>> though is that it can't be reproducible at all if its all based on  
>> hidden data and methods.
>
> Again: how are you going to get a better data set?

Well there's always my 8-ball :-). Getting better data, would require  
substantial effort. However, its that substantial effort that leads  
to evidence. Without the effort we don't really have evidence. We  
have someone poking around the Google cache.

With substantial effort we have the evidence that Josh has  
volunteered to produce. With some effort we might enlist a university  
somewhere to help us conduct real scientific statistical research.  
They could generate a comprehensive list or lists (properly weighted  
to ensure equal probabilities of selection. Randomly draw  
participants from the list. Then conduct phone, web, mail, or in- 
person interviews. Some participants could be selected to participate  
in a 'bot analysis where their data is sucked through some extraction  
algorithm that leaves the actual content obscured, but lets us see  
all of the tag goodness inside. Would this work? I don't know. Its  
not my field. Is it worth it all? I don't think so, but others seem  
very interested in statistical evidence.  However, studies like this  
are conducted everyday. And studies like this get results on many  
occasions. It just takes a lot of specialists who know a lot more  
about this stuff than I do to pull it off.

Take care,
Rob
Received on Thursday, 16 August 2007 13:39:25 UTC