W3C home > Mailing lists > Public > public-evangelist@w3.org > June 2003

valid HTML statistics wrt http://www.w3.org/QA/2002/04/Web-Quality

From: Ben Meadowcroft <cee.plus@virgin.net>
Date: Tue, 10 Jun 2003 09:53:23 +0100
Message-ID: <002d01c32f2d$c08aedf0$cec486d9@BensPC>
To: <public-evangelist@w3.org>

In the document available at http://www.w3.org/QA/2002/04/Web-Quality it is
stated that

"Most of the Web sites on the Web are not valid. We may assume that this is
the case for 99% of the Web pages, but there are no statistics to support
this. It would be interesting to run a survey to prove that this case is
indeed true."

It is true that a large number of websites are invalid. There was a thesis
recently written entitled "How to cope with incorrect HTML", which dealt
with the nature of errors in HTML documents and strategies for overcoming
them. As part of this thesis an investigation into the number of invalid
documents and the type of errors was performed.

The results are available from the thesis ( urn:isbn:82-8088-088-7 ),
available from http://www.ub.uib.no/elpub/2001/h/413001/

I have summarised the results in an entry on my weblog available at
The sample size was 2,398,226 documents of which 14,563 were valid HTML
documents. Taking factors such as unknown DTD's etc into account the number
of documents tested which were valid was 0.71%

I hope this information is of some interest.
Ben Meadowcroft
Received on Tuesday, 10 June 2003 04:53:29 UTC

This archive was generated by hypermail 2.3.1 : Tuesday, 6 January 2015 20:16:18 UTC