Re: Requirements for research (Was: Dropping <input usemap="">) from Robert Burns on 2007-08-16 (public-html@w3.org from August 2007)

From: Robert Burns <rob@robburns.com>
Date: Wed, 15 Aug 2007 22:40:17 -0500
To: Ian Hickson <ian@hixie.ch>
Cc: public-html <public-html@w3.org>
Message-Id: <780BD236-F8B8-4317-BC47-5C4D94120F4F@robburns.com>
Hi Ian.

On Aug 15, 2007, at 5:22 AM, Ian Hickson wrote:

>
> On Wed, 15 Aug 2007, Robert Burns wrote:
>>
>> If we can get financial backing for real scientific studies  
>> (statistical
>> or otherwise) of these things, I'd be all for that. I don't know  
>> whose
>> going to come up with the kind of money that would be involved,  
>> but it
>> might be a useful evidence gathering exercise.
>
> It would be helpful to find out what exactly you would consider  
> "real".
> Google has in fact been funding a series of studies to support this  
> work
> for several years now, and if I could somehow change our  
> methodology or
> report some information that would change our studies from whatever  
> they
> are now to "real" studies, I would be interested in doing that. It  
> might
> not be possible, as some aspects of our studies' methodology  
> involve our
> proprietary technologies, but I would certainly be interested in  
> seeing if
> it was possible.

I would say that if we want to just poke ,round the web to see some  
general trends , what you're doing rigbht now is fine. As an  
exploratory move it can gather useful information to guide discussion  
and even help identify other scientific research needs. However, if  
we really want to base adding or dropping features on statistical  
research (and I'm not so sure we do), it should be done in a  
scientific approach and not just someone poking around the web to see  
what's there.

A scientific approach would involve several things. It would be  
conducted with a goal to retrieve unbiased data. That means giving  
every HTML document an equal probability of selection. Right now,  
you're conducting research based on entries in a Google cache. Its  
biased toward pages that want googles attention. Those pages b behind  
firewalls,, or on local derives are completely left out of the  
research. I don't have any research on this, but I would expect such  
pages to often pay more attention to details than the pages fighting  
for Google's attention. It would be like looking through the emails,  
passing through an email server and concluding that most emails are  
about penis enlargement or counterfeit watches.

Stating hypotheses is just the first step. Then it requires thinking  
through the process of how to reach the population of the study so  
that each member of the population has that equal probability of  
selection. Trying to conduct a census of all pages is actually  
counter-productive: it goes for the low-lying fruit and give up on   
the difficult HTML documents. It s the same as adding a poll to a  
blog and then trying to extrapolate from the results of the poll to a  
broader population. It just doesn't work. Sure you got responses from  
100% of the population responding to the poll, but in what relation  
does that population stand to the broad population you want to  
interpolate those results to. Also most of the basis for statistical  
approaches relies on sample sizes that are very small compared to the  
population. So that's a problem too, because if your sample size  
becomes a significant proportion of the population, then all of the  
methods for calculating confidence intervals, standard deviations and  
the like no longer apply. Much of the magic of scientific statistical  
analysis lies in the magic property that when you sample the same  
population many times in many cases the distribution of the sample  
around a mean becomes "normal" regardless of he distribution of the  
population. When sample size approaches the population size you tend  
to lose that property.

Genuine scientific statistical research also lays out methodology and  
is reproducible.  From a scientific perspective, saying I searched a  
cache that I have, that you can't search and I won't even show you  
the code that produces that cache , would be the same as me saying  
the following. "I have this 8-ball and when I ask it if we should  
drop @usamap from |input| it tells me 'not likely'. You may say that  
sure, 8-balls say that But the odd part is that it says that every  
time [cue eerie music]." :-) The point though is that it can't be  
reproducible at all if its all based on hidden data and methods.

Second, I think the research would state up front what its  
expectations are. As a WG, we should need to lay out a hypothesis and  
decide what criteria that hypothesis should effect once we collect  
the results. Again, I'm not so sure we should be letting statistical  
research guide our decisions about the recommendation, but if we are  
then we need to put our money where our moth is. Then we need to  
state upfront what impact certain statistical results would have on  
the recommendation. Otherwise we're just rationalizing after the  
fact. Its not genuine evidence applied to principles, its just a  
twisting of evidence to fit the principles.

I'm not a statistician, and I expect there's some on this list who  
know more about this field than I do. I have some experience with  
statistics and many colleagues who are immersed in it, so its hard  
for some of this to not rub off on me. However, I feel statistics  
should be kept in its place. It can shape the discussion, but I don't  
think it should rule the conversation. If its to play a large role,  
then it really needs to be done scientifically. Just poking around  
web caches is not something I would call 'evidence'. 'Interesting  
information' sometimes, but not 'evidence'.

Take care,
Rob
Received on Thursday, 16 August 2007 03:40:24 UTC