- From: Robert Burns <rob@robburns.com>
- Date: Thu, 16 Aug 2007 08:39:09 -0500
- To: Geoffrey Sneddon <foolistbar@googlemail.com>
- Cc: public-html@w3.org
Hi Geoffrey, On Aug 16, 2007, at 7:59 AM, Geoffrey Sneddon wrote: > On 16 Aug 2007, at 04:40, Robert Burns wrote: > >> A scientific approach would involve several things. It would be >> conducted with a goal to retrieve unbiased data. That means giving >> every HTML document an equal probability of selection. Right now, >> you're conducting research based on entries in a Google cache. Its >> biased toward pages that want googles attention. Those pages b >> behind firewalls,, or on local derives are completely left out of >> the research. I don't have any research on this, but I would >> expect such pages to often pay more attention to details than the >> pages fighting for Google's attention. It would be like looking >> through the emails, passing through an email server and concluding >> that most emails are about penis enlargement or counterfeit watches. > > May I ask how you propose on getting a better data set than > Google's cache? You're highly unlikely to get data from behind > firewalls or local drives. Scientific evidence ain't cheap. That's part of my point. However, there's many reasons to think that going after the low-hanging fruit with our statistics is biasing the results substantially (and is certainly not a scientific or evidence producing approach). Outside firewalls is that no-mans land of the internet that we all love. Its filled with spammers and porn and all sorts of strange characters. Behind the firewalls and on our local drives is likely to be tamer and maybe even more standards focussed markup. >> Genuine scientific statistical research also lays out methodology >> and is reproducible. From a scientific perspective, saying I >> searched a cache that I have, that you can't search and I won't >> even show you the code that produces that cache , would be the >> same as me saying the following. "I have this 8-ball and when I >> ask it if we should drop @usamap from |input| it tells me 'not >> likely'. You may say that sure, 8-balls say that But the odd part >> is that it says that every time [cue eerie music]." :-) The point >> though is that it can't be reproducible at all if its all based on >> hidden data and methods. > > Again: how are you going to get a better data set? Well there's always my 8-ball :-). Getting better data, would require substantial effort. However, its that substantial effort that leads to evidence. Without the effort we don't really have evidence. We have someone poking around the Google cache. With substantial effort we have the evidence that Josh has volunteered to produce. With some effort we might enlist a university somewhere to help us conduct real scientific statistical research. They could generate a comprehensive list or lists (properly weighted to ensure equal probabilities of selection. Randomly draw participants from the list. Then conduct phone, web, mail, or in- person interviews. Some participants could be selected to participate in a 'bot analysis where their data is sucked through some extraction algorithm that leaves the actual content obscured, but lets us see all of the tag goodness inside. Would this work? I don't know. Its not my field. Is it worth it all? I don't think so, but others seem very interested in statistical evidence. However, studies like this are conducted everyday. And studies like this get results on many occasions. It just takes a lot of specialists who know a lot more about this stuff than I do to pull it off. Take care, Rob
Received on Thursday, 16 August 2007 13:39:25 UTC