- From: Ian Hickson <ian@hixie.ch>
- Date: Tue, 15 May 2007 07:53:39 +0000 (UTC)
- To: "Jukka K. Korpela" <jkorpela@cs.tut.fi>
- Cc: www-html@w3.org
On Tue, 15 May 2007, Jukka K. Korpela wrote: > > > > > > > > Sample size: several billion pages. > > > > > > It's hardly a sample. (See Statistics 101.) > > > > It's a sample, though the Web provides us with a somewhat unique > > situation in that there's an infinite number of pages, and we have to > > somehow pick a relevant subset from that. > > No, it's not a sample. You seem to be even uncertain about what might be > the population, so how could you draw a sample? It's a sample from a very specific sampling frame (that I know, though, as mentioned, I can't tell you what it is exactly). The sampling frame itself is a biased subset of the Web, biased by what Google has algorithmically established would be most "interesting" to its potential users, which are themselves pretty much the same set of people that the HTML specification is targetting. Thus, the pages scanned are a sample from a sampling frame that's a subset of pages chosen to be the most interesting to the people for which the data is indirectly being collected. > > The pages I scanned for this study are a small subset of those Google > > knows about. > > Which in turn are not the same thing as the set of all web pages, no > matter exactly how you define "web page". Right. That's what I said in my early e-mail. There's an infinite number of Web pages. > > Unfortunately for business reasons I can't reveal much about the > > methodology used for picking the sample. > > Then the data you present is worthless in a discussion and especially in > a debate. I didn't present it. Lachlan did. As I keep saying, I strongly encourage other people to do the same kinds of studies. Personally I explicitly sought employement at a company that was capable of providing me with the tools to do this kind of research, specifically so that I could use this data to make the specifications I was writing better. I would be extremely happy to see other people do the same thing. Reproducing research is one of the basic ways real scientific research works. (My research is not scientific, since I can't tell you how I did it.) I know that the data is representative enough to draw solid conclusions to design the language from. I have no way to convince you of that. Some people (e.g. Lachlan) have decided to trust me and have made good use of the data I have collected, but I would not encourage this. My analysis could be deeply flawed, my numbers could be lies. However, as the editor of the HTML 5 specification I'm taking all the feedback I have into account, including the research I've done. It would be stupid of me to ignore solid numbers, especially those that I know are representative. Cheers, -- Ian Hickson U+1047E )\._.,--....,'``. fL http://ln.hixie.ch/ U+263A /, _.. \ _\ ;`._ ,. Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.'
Received on Tuesday, 15 May 2007 07:53:51 UTC