Re: measuring popularity of various markup idioms? [tagSoupIntegration ISSUE-54] from Karl Dubost on 2008-04-10 (www-tag@w3.org from April 2008)

From: Karl Dubost <karl@w3.org>
Date: Thu, 10 Apr 2008 09:41:18 +0900
To: Dan Connolly <connolly@w3.org>
Cc: www-tag <www-tag@w3.org>
Message-Id: <F0ABD11F-BAD3-43B0-B6CE-A19F66CF8BD4@w3.org>

Le 10 avr. 2008 à 07:14, Dan Connolly a écrit :
> In the HTML WG, Ian Hickson sometimes does statistical measurements
> of HTML usage using google's index of the web. These studies are
> really nice, but they would be even nicer if we could confirm
> them from independent sources.

+1 but there are non trivial issues.

> A few weeks back, we had a discussion of this...
>  "Can we get access to tools that determine how often markup is used  
> on
> the web?"
>  http://www.w3.org/2008/02/21-html-wg-minutes#item05


A few issues and it depends on what is considered "tools".

* The URI sets (which is different from the document sets)
   Ian Hickson relies on Google and can access billions of uris with  
filters against spam pages. The way these documents have been kept in  
Google archive is not clear to me. Just plain data?

   - The age of the data
     Some URIs never change. Some have wrong caching, etag, date  
information. Some pages are highly dynamic and change often.
     Good for knowing how the Web should be read i.e. the content  
available now.
     Difficult to tell anything about current authoring practices.  
There is no way to know reliably if the URI (and its potential  
content) has been created after 1995, after 2000, after 2007.

* The collector
   When reaching a server with a user agent for a specific URI, the  
data stream you collect from this URI is dependent on Accept, User- 
Agent, etc. headers. Servers do user agent sniffing and content  
negotiation.

* The data analysis
   Once you have received the data stream, the content analysis and  
statistics gathering has dependencies on the parsing algorithm used  
for the data stream.

All these parameters modify the statistics and it is hard to evaluate  
the statistical deviations without establishing a protocol for  
analyzing them. Still it gives interesting data, but we should be  
careful on the conclusion drawn from these data.

-- 
Karl Dubost - W3C
http://www.w3.org/QA/
Be Strict To Be Cool

Received on Thursday, 10 April 2008 00:41:54 UTC