- From: Karl Dubost <karl@w3.org>
- Date: Thu, 10 Apr 2008 09:41:18 +0900
- To: Dan Connolly <connolly@w3.org>
- Cc: www-tag <www-tag@w3.org>
Le 10 avr. 2008 à 07:14, Dan Connolly a écrit :
> In the HTML WG, Ian Hickson sometimes does statistical measurements
> of HTML usage using google's index of the web. These studies are
> really nice, but they would be even nicer if we could confirm
> them from independent sources.
+1 but there are non trivial issues.
> A few weeks back, we had a discussion of this...
> "Can we get access to tools that determine how often markup is used
> on
> the web?"
> http://www.w3.org/2008/02/21-html-wg-minutes#item05
A few issues and it depends on what is considered "tools".
* The URI sets (which is different from the document sets)
Ian Hickson relies on Google and can access billions of uris with
filters against spam pages. The way these documents have been kept in
Google archive is not clear to me. Just plain data?
- The age of the data
Some URIs never change. Some have wrong caching, etag, date
information. Some pages are highly dynamic and change often.
Good for knowing how the Web should be read i.e. the content
available now.
Difficult to tell anything about current authoring practices.
There is no way to know reliably if the URI (and its potential
content) has been created after 1995, after 2000, after 2007.
* The collector
When reaching a server with a user agent for a specific URI, the
data stream you collect from this URI is dependent on Accept, User-
Agent, etc. headers. Servers do user agent sniffing and content
negotiation.
* The data analysis
Once you have received the data stream, the content analysis and
statistics gathering has dependencies on the parsing algorithm used
for the data stream.
All these parameters modify the statistics and it is hard to evaluate
the statistical deviations without establishing a protocol for
analyzing them. Still it gives interesting data, but we should be
careful on the conclusion drawn from these data.
--
Karl Dubost - W3C
http://www.w3.org/QA/
Be Strict To Be Cool
Received on Thursday, 10 April 2008 00:41:54 UTC