Re: Web page stats from Karl Dubost on 2006-10-03 (www-qa@w3.org from October 2006)

From: Karl Dubost <karl@w3.org>
Date: Tue, 3 Oct 2006 13:36:03 +0900
To: Ian Hickson <ian@hixie.ch>
Cc: www-qa@w3.org
Message-Id: <938BD98F-82F7-4782-B6F4-19534743BFD5@w3.org>
Le 30 sept. 06 à 05:56, Ian Hickson a écrit :
> On Fri, 29 Sep 2006, Karl Dubost wrote:
>>
>> It is why I have asked more details to Ian Hickson, because I really
>> think it is as much important as the derived statistics which have  
>> been
>> published in the [previous survey][1]. When the sample is not  
>> given or
>> clearly identified it is really difficult to draw meaningful
>> conclusions.
>
> This is absolutely true. This is why the survey(s) haven't been  
> published
> formally; due to the nature of the way in which the results were  
> obtained,
> I can't write a scientific report.

1. True to "we can't draw meaningful conclusions". It is not suitable  
scientific report.

> The data was collected for the purposes
> of helping WHATWG's spec development work

2. Google has created the survey for helping WHATWG.

> (I think all specifications
> should be written based on solid research of authoring practices,  
> etc),
> and I consider the data to be suitably representative for that  
> purpose.

3. The survey is a "solid research of authoring practices"

> For other purposes, the data probably isn't useful as anything  
> other than
> an idle curiosity, and I would not recommend treating it as  
> anything but
> that.

I have hard time to connect 1, 2 and 3 in a logical way.

>
> If you would like a more formal survey of the Web, I recommend
> comissioning your own. :-)

It is a good idea.
Maybe I should ask to TV Raman, Google if Google would agree to help  
us to do that.


>>    - DOCTYPE
>
> I'm not sure how you would define this; take this document, for  
> instance:
>
>    http://damowmow.com/playground/html-or-xml.html
> What's the DOCTYPE?
> How about this one:
>    http://damowmow.com/playground/html-or-xml.xml

Do you mean there are plenty of these documents on the Web?
Or are there just corner cases that has been created to identify  
potential problems?

using http://web-sniffer.net/

GET /playground/html-or-xml.xml HTTP/1.1[CRLF]
Accept: text/xml,application/xml,application/xhtml+xml,text/ 
html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5[CRLF]

GET /playground/html-or-xml.html HTTP/1.1[CRLF]
Accept: text/xml,application/xml,application/xhtml+xml,text/ 
html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5[CRLF]

GET /playground/html-or-xml HTTP/1.1[CRLF]
Accept: text/xml,application/xml,application/xhtml+xml,text/ 
html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5[CRLF]


I have just put the source here.
################
<?test ><!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0//EN">
<html><?test ><!-- ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
   "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<?test --><?test?>
  <head>
   <title>HTML or XML?</title>
  </head>
  <body>
   <p>Is this file HTML or XML?</p>
   <p>Why, it's <?test > HTML <!-- ?> XHTML <?test --> <?test ?> of  
course!</p>

  </body>
</html>
################

How many documents with this kind of structure have you found on the  
Web?

> What's the DOCTYPE?
> If your answer was different for the two pages, then why was it  
> different?
> The two pages are byte-for-byte identical. If your answer was the  
> same,
> then why were they the same? Browsers treat the two very differently.

Your document is sent as text/xml
	     and then as application/xhtml+xml
              and then as text/html if the first is not understood.
plus the problem of encoding.


> (This is why my survey mostly ignored the DOCTYPE and instead just  
> assumed
> HTML5 parsing rules.)

Then Google has created a "WebApps 1.0 parser" for the purpose of the  
survey?
Is the code accessible somewhere?
Was it a crawler?
Was it a parser working on files outside of their HTTP context?



-- 
Karl Dubost - http://www.w3.org/People/karl/
W3C Conformance Manager, QA Activity Lead
   QA Weblog - http://www.w3.org/QA/
      *** Be Strict To Be Cool ***
Received on Tuesday, 3 October 2006 04:36:24 UTC