Re: The use of W3C standards in Denmark Part II -- some other data from Olle Olsson on 2004-03-06 (public-evangelist@w3.org from March 2004)

From: Olle Olsson <olleo@sics.se>
Date: Sat, 06 Mar 2004 11:16:36 +0100
To: Soren Johannessen <hal@ae35-unit.dk>
Cc: "'Karl Dubost'" <karl@w3.org>, public-evangelist@w3.org
Message-ID: <4049A504.9060301@sics.se>
Some statistics about correctness of web pages.

... which confirms some of the data presented in this thread.

Two years ago I made a minor survey of error frequency in Swedish web
pages. The aim was to get some initial data about how well standards are
adhered to.

I only looked at commercial websites, and only on the home page of these
sites. The commercial web sites selected were fetched from a list companies
registered on the stock exchange, a list that provides information for
financial analysts.

The investigation was, for simplicity restricted to automatic validity
checking, with some manual intervention in the selection and filtering
process.

The W3C HTML validator was the tool used, and no attempt was made to 
obtain a
finegrained classification of the types of errors found -- something that is
inherently difficult, if one tries an approach based on automated
analysis. At that point in time, the only quick-and-dirty way of accessing
the results of the validator was to extract ínformation from the HTML page
returned by the validator. Not a nice way, but it was doable. There was some
discussion at that point that a more programmatic interface would be useful,
e.g. "W3C Validator as a Web Service". But I had to make do with what was
available, hence the indirect way of dissecting the page generated by the
Validator.

There were a number of situations that had to be clearly identified, if the
outcome was to be trusted.  E.g., to be able to separate those cases 
where no
real checking was made made by the Validator, I had to identify at least
those occurences where the Validator complained that it could not check the
page for some reason.

Some statistics
---------------

The initial list of companies (company web sites) that was used ...

 - number of initially selected pages: 330 pages

But some sites were down or returned something that was not HTML, so...

 - number of actually investigated pages: 280 pages

The sizes of the pages varied significantly:

 - minimum page size: 89 characters
 - maximum page size: 129 832 characters

DOCTYPE was a big problem:

 - percentage of pages that did not declare DOCTYPE: 76 %

... or at least no DOCTYPE was recognised by the validator!

No page passed W3C Validator checking! The span in error remarks was
 - minimum number of errors: 1
 - maximum number of errors: 591

As some pages triggered avalanches of errors, and some pages were extreme in
size, they could create bad effects on the statistics. So "extreme" pages
were filtered out, and statistics was only retained for pages that fulfilled
the conditions:

 - size of page: 1,000 -- 50,000 characters
 - number of error remarks on page: < 200

As to the reason for excluding "small" pages is obvious -- the size of a
correct (and reasonable) "hello world" page is on the order of 200
characters. Small pages are also found when FRAMESETs are used, and such
pages were not further studied.

This resulted in an effective set of pages ...

 - number of pages used in final statistics: 226

This set was partitioned into two groups -- IT-companies and other
companies. The reason for this partitioning was that one would think that
IT-companies (e.g. IT-vendors or IT consultants) would be better at
constructing correct pages than other companies (e.g. household appliance
vendors or transport companies) ...

 - number of IT companies: 76
 - number of other companies: 150

The result turned out to be that no clear difference, w.r.t. errors, 
could be
detected between these two types of companies, which was not what one would
expect.

Documentation:
-------------
There is a write-up (in Swedish ;-) ) of these things on:
  http://www.w3c.se/resources/office/papers/memo1/memo1.html

On that page there are also some diagrams that describe the correlation 
between

 * size-of-page vs numbers-of-errors
 * size-of-page vs numbers-of-errors-per-1000-characters-of-page

Each green dot represents one of the 226 pages investigated.

For those that might want to look at the diagrams, the texts in these can be
translated as:

 * "antalet" = "number"
 * "sidstorlek" = "page size"
 * "anmärkningar" = "error remarks"
 * "antalet anmärkningar för sida" = "number of error remarks per page"
 * "antalet anmärkningar per 1000 tkn i sida" = "number of error remarks 
per 1000 chars in page"
 * "anmärkningar/1000 tkn" = "error remarks/1000 chars"


The two last diagrams on that page portray the relationship between size of
pages and the number of pages of that size. Here we aggregate pages into
groups of pages (0-2500, 2501-5000, 5000-7500, ... page size in characters)
and do the correlation of the size represented by each group to the 
number of
pages in each group. This is nicely described as a Zipf distribution.

=================================

/olle

-- 
------------------------------------------------------------------
Olle Olsson   olleo@sics.se   Tel: +46 8 633 15 19  Fax: +46 8 751 72 30
	[Svenska W3C-kontoret: olleo@w3.org]
SICS [Swedish Institute of Computer Science]
Box 1263
SE - 164 29 Kista
Sweden
------------------------------------------------------------------
Received on Saturday, 6 March 2004 05:16:44 UTC