Re: Web page stats

On Thu, 28 Sep 2006, Karl Dubost wrote:
> 
> 1. Is there any plans for releasing a new version of the survey made by 
> Google?

Google does not comment on future plans.


> 2. Could you give the approximate ratio of Web pages for this.
> 	- Not valid but well-formed.
>   	- Not valid and not well-formed.

Approximately 78% of pages have syntax errors more serious than missing or 
incorrect DOCTYPEs and bogus trailing "/" characters in start tags. The 
parser I used didn't check for validity (e.g. it didn't check that <p> 
elements weren't inside <a> elements); it basically only tested for 
syntactic correctness according to the HTML5 parser spec, ignoring the 
DOCTYPE requirements and the trailing "/" error (as in "<foo/>").

Over 13% of pages had duplicate IDs (multiple elements with the same value 
on the "id" attribute; I didn't check case-insensitively, nor did I check 
for collisions with the "name" attribute, both of which would be required 
for strict HTML4 compliance).

The average (median) page had fifteen syntax errors according to the rules 
for finding syntax errors described in the HTML5 parser specification.

The most common error (after DOCTYPE-related errors and bogus trailing 
slash errors) was the use of "</" in CDATA sections. The next most common 
error was incorrectly placed content in <table> elements. The third most 
common error was misnesting of <form> elements.

The sample in question was very large (10 digits), so these are pretty 
representative numbers.


> 3. Could you give the approximate ratio of declared HTML 4, XHTML 1.0, 
> XHTML 1.1 documents?

In my sample, the number of pages labelled as application/xhtml+xml 
outweighed the number of pages marked text/html by a factor so large that 
it is probably not statistically meaningful.

-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'

Received on Thursday, 28 September 2006 18:41:28 UTC