- From: Olle Olsson <olleo@sics.se>
- Date: Sat, 06 Mar 2004 11:16:36 +0100
- To: Soren Johannessen <hal@ae35-unit.dk>
- Cc: "'Karl Dubost'" <karl@w3.org>, public-evangelist@w3.org
Some statistics about correctness of web pages. ... which confirms some of the data presented in this thread. Two years ago I made a minor survey of error frequency in Swedish web pages. The aim was to get some initial data about how well standards are adhered to. I only looked at commercial websites, and only on the home page of these sites. The commercial web sites selected were fetched from a list companies registered on the stock exchange, a list that provides information for financial analysts. The investigation was, for simplicity restricted to automatic validity checking, with some manual intervention in the selection and filtering process. The W3C HTML validator was the tool used, and no attempt was made to obtain a finegrained classification of the types of errors found -- something that is inherently difficult, if one tries an approach based on automated analysis. At that point in time, the only quick-and-dirty way of accessing the results of the validator was to extract ínformation from the HTML page returned by the validator. Not a nice way, but it was doable. There was some discussion at that point that a more programmatic interface would be useful, e.g. "W3C Validator as a Web Service". But I had to make do with what was available, hence the indirect way of dissecting the page generated by the Validator. There were a number of situations that had to be clearly identified, if the outcome was to be trusted. E.g., to be able to separate those cases where no real checking was made made by the Validator, I had to identify at least those occurences where the Validator complained that it could not check the page for some reason. Some statistics --------------- The initial list of companies (company web sites) that was used ... - number of initially selected pages: 330 pages But some sites were down or returned something that was not HTML, so... - number of actually investigated pages: 280 pages The sizes of the pages varied significantly: - minimum page size: 89 characters - maximum page size: 129 832 characters DOCTYPE was a big problem: - percentage of pages that did not declare DOCTYPE: 76 % ... or at least no DOCTYPE was recognised by the validator! No page passed W3C Validator checking! The span in error remarks was - minimum number of errors: 1 - maximum number of errors: 591 As some pages triggered avalanches of errors, and some pages were extreme in size, they could create bad effects on the statistics. So "extreme" pages were filtered out, and statistics was only retained for pages that fulfilled the conditions: - size of page: 1,000 -- 50,000 characters - number of error remarks on page: < 200 As to the reason for excluding "small" pages is obvious -- the size of a correct (and reasonable) "hello world" page is on the order of 200 characters. Small pages are also found when FRAMESETs are used, and such pages were not further studied. This resulted in an effective set of pages ... - number of pages used in final statistics: 226 This set was partitioned into two groups -- IT-companies and other companies. The reason for this partitioning was that one would think that IT-companies (e.g. IT-vendors or IT consultants) would be better at constructing correct pages than other companies (e.g. household appliance vendors or transport companies) ... - number of IT companies: 76 - number of other companies: 150 The result turned out to be that no clear difference, w.r.t. errors, could be detected between these two types of companies, which was not what one would expect. Documentation: ------------- There is a write-up (in Swedish ;-) ) of these things on: http://www.w3c.se/resources/office/papers/memo1/memo1.html On that page there are also some diagrams that describe the correlation between * size-of-page vs numbers-of-errors * size-of-page vs numbers-of-errors-per-1000-characters-of-page Each green dot represents one of the 226 pages investigated. For those that might want to look at the diagrams, the texts in these can be translated as: * "antalet" = "number" * "sidstorlek" = "page size" * "anmärkningar" = "error remarks" * "antalet anmärkningar för sida" = "number of error remarks per page" * "antalet anmärkningar per 1000 tkn i sida" = "number of error remarks per 1000 chars in page" * "anmärkningar/1000 tkn" = "error remarks/1000 chars" The two last diagrams on that page portray the relationship between size of pages and the number of pages of that size. Here we aggregate pages into groups of pages (0-2500, 2501-5000, 5000-7500, ... page size in characters) and do the correlation of the size represented by each group to the number of pages in each group. This is nicely described as a Zipf distribution. ================================= /olle -- ------------------------------------------------------------------ Olle Olsson olleo@sics.se Tel: +46 8 633 15 19 Fax: +46 8 751 72 30 [Svenska W3C-kontoret: olleo@w3.org] SICS [Swedish Institute of Computer Science] Box 1263 SE - 164 29 Kista Sweden ------------------------------------------------------------------
Received on Saturday, 6 March 2004 05:16:44 UTC