Re: HTML/XHTML usage survey from olivier Thereaux on 2005-01-11 (public-evangelist@w3.org from January 2005)

From: olivier Thereaux <ot@w3.org>
Date: Tue, 11 Jan 2005 14:37:55 +0900
To: 'public-evangelist@w3.org' <public-evangelist@w3.org>
Cc: Karl Dubost <karl@w3.org>
Message-Id: <F19FB73D-6392-11D9-9C98-000393A80896@w3.org>

On Jan 6, 2005, at 4:51, Karl Dubost wrote:
> * Statistical Quantitative analysis  (automatic)
> - Which HTML elements are used in Web pages?
> - Which frequency ?
> - Are valid Web pages richer than non-valid ones. (bigger varieties of 
> HTML element)
> - The same for attribute

I am willing to test the idea of a statistical analysis module for the 
logvalidator [1], and wonder if anyone would be interested to work with 
me on this.

[1] http://www.w3.org/QA/Tools/LogValidator/

This might not exactly perform the large-scale study that Karl is 
thinking of, but it could be a start.

I am thinking at the moment of making this module provide:
- a rapid summary of element usage over a list of documents
- list the n most popular documents without a "real" title ("Welcome to 
GoLive" does not qualify ;)
- ratio of empty versus filled alt attributes

Someone recently gave me the idea that the ratio of words over markup 
is a decent metric for either the "richness" of the page or (if low), a 
good indicator or a badly written site. Given that an image is worth a 
thousand words, I assume our formula would be something like 
(number_of_words+(number_images*1000)) / (number_html_elements).

Implementation-wise, does anyone have a recommendation on the library 
to use?

Please drop me a line if you wish to participate in this development.
-- 
olivier

Received on Tuesday, 11 January 2005 05:37:58 UTC