Re: code, samp, kbd, var from Ian Hickson on 2007-05-15 (www-html@w3.org from May 2007)

From: Ian Hickson <ian@hixie.ch>
Date: Tue, 15 May 2007 07:53:39 +0000 (UTC)
To: "Jukka K. Korpela" <jkorpela@cs.tut.fi>
Cc: www-html@w3.org
Message-ID: <Pine.LNX.4.62.0705150728350.11553@dhalsim.dreamhost.com>

On Tue, 15 May 2007, Jukka K. Korpela wrote:
> > > > 
> > > > Sample size: several billion pages.
> > > 
> > > It's hardly a sample. (See Statistics 101.)
> > 
> > It's a sample, though the Web provides us with a somewhat unique 
> > situation in that there's an infinite number of pages, and we have to 
> > somehow pick a relevant subset from that.
> 
> No, it's not a sample. You seem to be even uncertain about what might be 
> the population, so how could you draw a sample?

It's a sample from a very specific sampling frame (that I know, though, as 
mentioned, I can't tell you what it is exactly). The sampling frame itself 
is a biased subset of the Web, biased by what Google has algorithmically 
established would be most "interesting" to its potential users, which are 
themselves pretty much the same set of people that the HTML specification 
is targetting. Thus, the pages scanned are a sample from a sampling frame 
that's a subset of pages chosen to be the most interesting to the people 
for which the data is indirectly being collected.

> > The pages I scanned for this study are a small subset of those Google 
> > knows about.
> 
> Which in turn are not the same thing as the set of all web pages, no 
> matter exactly how you define "web page".

Right. That's what I said in my early e-mail. There's an infinite number 
of Web pages.

> > Unfortunately for business reasons I can't reveal much about the 
> > methodology used for picking the sample.
> 
> Then the data you present is worthless in a discussion and especially in 
> a debate.

I didn't present it. Lachlan did. As I keep saying, I strongly encourage 
other people to do the same kinds of studies. Personally I explicitly 
sought employement at a company that was capable of providing me with the 
tools to do this kind of research, specifically so that I could use this 
data to make the specifications I was writing better. I would be extremely 
happy to see other people do the same thing. Reproducing research is one 
of the basic ways real scientific research works. (My research is not 
scientific, since I can't tell you how I did it.)

I know that the data is representative enough to draw solid conclusions to 
design the language from. I have no way to convince you of that. Some 
people (e.g. Lachlan) have decided to trust me and have made good use of 
the data I have collected, but I would not encourage this. My analysis 
could be deeply flawed, my numbers could be lies. However, as the editor 
of the HTML 5 specification I'm taking all the feedback I have into 
account, including the research I've done. It would be stupid of me to 
ignore solid numbers, especially those that I know are representative.

Cheers,
-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'

Received on Tuesday, 15 May 2007 07:53:51 UTC