- From: Karl Dubost <karl@w3.org>
- Date: Thu, 10 Apr 2008 09:41:18 +0900
- To: Dan Connolly <connolly@w3.org>
- Cc: www-tag <www-tag@w3.org>
Le 10 avr. 2008 à 07:14, Dan Connolly a écrit : > In the HTML WG, Ian Hickson sometimes does statistical measurements > of HTML usage using google's index of the web. These studies are > really nice, but they would be even nicer if we could confirm > them from independent sources. +1 but there are non trivial issues. > A few weeks back, we had a discussion of this... > "Can we get access to tools that determine how often markup is used > on > the web?" > http://www.w3.org/2008/02/21-html-wg-minutes#item05 A few issues and it depends on what is considered "tools". * The URI sets (which is different from the document sets) Ian Hickson relies on Google and can access billions of uris with filters against spam pages. The way these documents have been kept in Google archive is not clear to me. Just plain data? - The age of the data Some URIs never change. Some have wrong caching, etag, date information. Some pages are highly dynamic and change often. Good for knowing how the Web should be read i.e. the content available now. Difficult to tell anything about current authoring practices. There is no way to know reliably if the URI (and its potential content) has been created after 1995, after 2000, after 2007. * The collector When reaching a server with a user agent for a specific URI, the data stream you collect from this URI is dependent on Accept, User- Agent, etc. headers. Servers do user agent sniffing and content negotiation. * The data analysis Once you have received the data stream, the content analysis and statistics gathering has dependencies on the parsing algorithm used for the data stream. All these parameters modify the statistics and it is hard to evaluate the statistical deviations without establishing a protocol for analyzing them. Still it gives interesting data, but we should be careful on the conclusion drawn from these data. -- Karl Dubost - W3C http://www.w3.org/QA/ Be Strict To Be Cool
Received on Thursday, 10 April 2008 00:41:54 UTC