RE: The use of W3C standards in Denmark Part II -- some other data from Soren Johannessen on 2004-03-06 (public-evangelist@w3.org from March 2004)

From: Soren Johannessen <hal@ae35-unit.dk>
Date: Sat, 6 Mar 2004 12:07:36 +0100
To: "'Olle Olsson'" <olleo@sics.se>
Cc: "'Karl Dubost'" <karl@w3.org>, <public-evangelist@w3.org>
Message-ID: <000901c4036b$3be390d0$0300000a@ae35>
Hi Olle

One question does the Swedish government also recommend/advocate W3C
standards in  Swedish  governmental/national/municipal authorities home
page? It could be interesting to compare Denmark and Sweden 
(also Norway & Finland if they have same politics for use of W3C
Standards)

/Soren




Some statistics about correctness of web pages.

... which confirms some of the data presented in this thread.

Two years ago I made a minor survey of error frequency in Swedish web
pages. The aim was to get some initial data about how well standards are
adhered to.

I only looked at commercial websites, and only on the home page of these
sites. The commercial web sites selected were fetched from a list
companies registered on the stock exchange, a list that provides
information for financial analysts.

The investigation was, for simplicity restricted to automatic validity
checking, with some manual intervention in the selection and filtering
process.

The W3C HTML validator was the tool used, and no attempt was made to 
obtain a
finegrained classification of the types of errors found -- something
that is inherently difficult, if one tries an approach based on
automated analysis. At that point in time, the only quick-and-dirty way
of accessing the results of the validator was to extract ínformation
from the HTML page returned by the validator. Not a nice way, but it was
doable. There was some discussion at that point that a more programmatic
interface would be useful, e.g. "W3C Validator as a Web Service". But I
had to make do with what was available, hence the indirect way of
dissecting the page generated by the Validator.

There were a number of situations that had to be clearly identified, if
the outcome was to be trusted.  E.g., to be able to separate those cases

where no
real checking was made made by the Validator, I had to identify at least
those occurences where the Validator complained that it could not check
the page for some reason.

Some statistics
---------------

The initial list of companies (company web sites) that was used ...

 - number of initially selected pages: 330 pages

But some sites were down or returned something that was not HTML, so...

 - number of actually investigated pages: 280 pages

The sizes of the pages varied significantly:

 - minimum page size: 89 characters
 - maximum page size: 129 832 characters

DOCTYPE was a big problem:

 - percentage of pages that did not declare DOCTYPE: 76 %

... or at least no DOCTYPE was recognised by the validator!

No page passed W3C Validator checking! The span in error remarks was
 - minimum number of errors: 1
 - maximum number of errors: 591

As some pages triggered avalanches of errors, and some pages were
extreme in size, they could create bad effects on the statistics. So
"extreme" pages were filtered out, and statistics was only retained for
pages that fulfilled the conditions:

 - size of page: 1,000 -- 50,000 characters
 - number of error remarks on page: < 200

As to the reason for excluding "small" pages is obvious -- the size of a
correct (and reasonable) "hello world" page is on the order of 200
characters. Small pages are also found when FRAMESETs are used, and such
pages were not further studied.

This resulted in an effective set of pages ...

 - number of pages used in final statistics: 226

This set was partitioned into two groups -- IT-companies and other
companies. The reason for this partitioning was that one would think
that IT-companies (e.g. IT-vendors or IT consultants) would be better at
constructing correct pages than other companies (e.g. household
appliance vendors or transport companies) ...

 - number of IT companies: 76
 - number of other companies: 150

The result turned out to be that no clear difference, w.r.t. errors, 
could be
detected between these two types of companies, which was not what one
would expect.

Documentation:
-------------
There is a write-up (in Swedish ;-) ) of these things on:
  http://www.w3c.se/resources/office/papers/memo1/memo1.html

On that page there are also some diagrams that describe the correlation 
between

 * size-of-page vs numbers-of-errors
 * size-of-page vs numbers-of-errors-per-1000-characters-of-page

Each green dot represents one of the 226 pages investigated.

For those that might want to look at the diagrams, the texts in these
can be translated as:

 * "antalet" = "number"
 * "sidstorlek" = "page size"
 * "anmärkningar" = "error remarks"
 * "antalet anmärkningar för sida" = "number of error remarks per page"
 * "antalet anmärkningar per 1000 tkn i sida" = "number of error remarks

per 1000 chars in page"
 * "anmärkningar/1000 tkn" = "error remarks/1000 chars"


The two last diagrams on that page portray the relationship between size
of pages and the number of pages of that size. Here we aggregate pages
into groups of pages (0-2500, 2501-5000, 5000-7500, ... page size in
characters) and do the correlation of the size represented by each group
to the 
number of
pages in each group. This is nicely described as a Zipf distribution.

=================================

/olle

-- 
------------------------------------------------------------------
Olle Olsson   olleo@sics.se   Tel: +46 8 633 15 19  Fax: +46 8 751 72 30
	[Svenska W3C-kontoret: olleo@w3.org]
SICS [Swedish Institute of Computer Science]
Box 1263
SE - 164 29 Kista
Sweden
------------------------------------------------------------------
Received on Saturday, 6 March 2004 06:09:57 UTC