Re: Web page stats

On Tue, 3 Oct 2006, Karl Dubost wrote:
> > > 
> > > It is why I have asked more details to Ian Hickson, because I really 
> > > think it is as much important as the derived statistics which have 
> > > been published in the [previous survey][1]. When the sample is not 
> > > given or clearly identified it is really difficult to draw 
> > > meaningful conclusions.
> > 
> > This is absolutely true. This is why the survey(s) haven't been 
> > published formally; due to the nature of the way in which the results 
> > were obtained, I can't write a scientific report.
> 
> 1. True to "we can't draw meaningful conclusions". It is not suitable 
> scientific report.

It's not suitable for publishing as a scientific report because I can't 
describe the methodology, because it relies on Google-proprietary 
mechanisms that are confidential and material to the company's business.


> > The data was collected for the purposes of helping WHATWG's spec 
> > development work
> 
> 2. Google has created the survey for helping WHATWG.

I suppose you could look at it that way. I wouldn't phrase it that way, 
though; Google is just an inanimate legal construct (a company and its 
associated services) and WHATWG is just a Web page and a mailing list.


> > (I think all specifications should be written based on solid research 
> > of authoring practices, etc), and I consider the data to be suitably 
> > representative for that purpose.
> 
> 3. The survey is a "solid research of authoring practices"

In my opinion. But I can't back up that opinion in public, since I can't 
release the methodology.


> > For other purposes, the data probably isn't useful as anything other 
> > than an idle curiosity, and I would not recommend treating it as 
> > anything but that.
> 
> I have hard time to connect 1, 2 and 3 in a logical way.

The person who did the research is me. The person writing the WHATWG spec 
is me. I know what the methodology was and I'm convinced that it was sound 
and suitable for the purpose I'm using it for. This doesn't mean that I 
can convince anyone else of the soundness of the data.


> > If you would like a more formal survey of the Web, I recommend
> > comissioning your own. :-)
> 
> It is a good idea. Maybe I should ask to TV Raman, Google if Google 
> would agree to help us to do that.

I can't speak for TV; I don't know if that is something he would be 
interested in doing. If you are indeed interested in comissioning such 
research, that might well be a possible solution for you.


> > > - DOCTYPE
> > 
> > I'm not sure how you would define this; take this document, for instance:
> > 
> >    http://damowmow.com/playground/html-or-xml.html
> > What's the DOCTYPE?
> > How about this one:
> >    http://damowmow.com/playground/html-or-xml.xml
> 
> Do you mean there are plenty of these documents on the Web?

No. I mean that I need to know exactly what parsing algorithm I should use 
to determine what the DOCTYPE is.


> How many documents with this kind of structure have you found on the 
> Web?

I haven't checked.


> > What's the DOCTYPE?
> > If your answer was different for the two pages, then why was it different?
> > The two pages are byte-for-byte identical. If your answer was the same,
> > then why were they the same? Browsers treat the two very differently.
> 
> Your document is sent as text/xml
> 	     and then as application/xhtml+xml
>              and then as text/html if the first is not understood.
> plus the problem of encoding.

The third case is a separate case that really is neither here nor there. 
The question is what the parsing algorithm should be for unambiguously 
determining the DOCTYPE.


> > (This is why my survey mostly ignored the DOCTYPE and instead just 
> > assumed HTML5 parsing rules.)
> 
> Then Google has created a "WebApps 1.0 parser" for the purpose of the 
> survey?

"Google" is just a company. A Google employee (me) implemented the HTML5 
parser described in the Web Apps 1.0 specification for the purposes of the 
survey, yes. The main reason was to make sure the parser specification was 
free of logic bugs and to identify potential problems.


> Is the code accessible somewhere?
> Was it a crawler?
> Was it a parser working on files outside of their HTTP context?

As I mentioned previously, I can't answer these questions without 
potentially releasing information on Google's proprietary infrastructure. 
If I could, then the research could be released formally, and we wouldn't 
be having this conversation.

HTH,
-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'

Received on Tuesday, 3 October 2006 05:45:32 UTC