- From: Gerald Oskoboiny <gerald@w3.org>
- Date: Sat, 29 Sep 2001 04:41:51 -0400
- To: www-qa@w3.org, www-validator@w3.org
- Message-ID: <20010929044151.B13029@w3.org>
Attached is a script we use at W3C to improve the quality of our site, by checking to make sure the most visited pages on our site are valid HTML. At the bottom of this message is sample output from the script. Although we would like our site to be 100% valid, we have hundreds of thousands of documents on our site including many that are there for historical interest (some of them predating formal HTML specifications), and it isn't practical for us to make them all valid HTML. Many sites would just remove these old documents instead of leaving them up there basically unmaintained, but I think we generally feel it's better to have this information online in an invalid form than not have it available at all. Ideally, we would have the manpower or technology to be able to go through and make them all valid, but in practice we have too many other things competing for our time to make that feasible. Also, for some of the documents we have a policy of not changing them at all once they are published -- for example, our policy is not to change dated versions of documents in our TR space [1], so people can be sure they are discussing the same thing when talking about a particular version of a specification without having to compare checksums or something. So in order to help us focus our effort where it will do the most good, I wrote a script to check the most-visited pages on our site for HTML validity, using the online HTML validator [2]; this script is run once a week, and sends mail to the whole W3C team notifying us of the top invalid documents on our site. (it sends mail to the whole team rather than individuals partly to act as a form of peer pressure: most people don't want their areas to show up in this report week after week.) When we first ran this report against the documents on www.w3.org in September 1999, 20 of the top 53 documents on our site were invalid. In the most recent report (below), 20 of the top 449 documents are valid, and we know that more than half of the page views delivered to users are valid HTML. That's quite an improvement, and most sites should be able to do much better than that; I think we have an unusually large number of documents that are there for historical interest. This script won't be directly usable by other sites, because it relies on a bunch of things that are specific to our site (our log rotation scheme, our weird log file format, all the various DocumentRoots on our servers, etc.) But it might be useful as an example for other sites. There are a number of obvious improvements that could be made, for example: keep a local cache of last-modified dates for each document and don't revalidate them each week if they haven't changed. [1] http://www.w3.org/TR/ [2] http://validator.w3.org/ ----- Forwarded message from Gerald Oskoboiny <gerald@w3.org> ----- Date: Wed, 26 Sep 2001 05:25:08 -0400 (EDT) From: Gerald Oskoboiny <gerald@w3.org> Subject: Most popular documents on our site, 2001-09-25 To: bar@example.org Here are the most frequently-requested HTML documents on our site that do not validate, with the their overall rank and the number of times they were requested from our site in the last 4 days. Rank Hits URI ------ ------- --------------------------------------------------------- 80 2165 http://www.w3.org/TR/wsdl.html 105 1708 http://www.w3.org/Security/Faq/www-security-faq.html 132 1423 http://www.w3.org/Security/Faq/ 204 1007 http://www.w3.org/2001/XMLSchema.html 240 898 http://www.w3.org/P3P/compliant_sites.php3 263 831 http://www.w3.org/2000/01/sw/ 266 822 http://www.w3.org/MarkUp/html-spec/html-spec_8.html 269 812 http://www.w3.org/TR/NOTE-VML.html 278 796 http://www.w3.org/TR/REC-png.html 320 685 http://www.w3.org/PICS/iacwcv2.htm 335 642 http://www.w3.org/TR/NOTE-datetime.html 355 618 http://www.w3.org/TR/1998/NOTE-XML-data-0105/ 366 603 http://www.w3.org/TR/1998/NOTE-compactHTML-19980209/ 382 585 http://www.w3.org/RDF/Validator/ 402 563 http://www.w3.org/TR/WD-logfile.html 406 562 http://www.w3.org/MarkUp/html-spec/html-spec_5.html 420 549 http://www.w3.org/TR/smil-animation/ 429 539 http://www.w3.org/Style/XSL/WhatIsXSL.html 438 532 http://lists.w3.org/Archives/Public/w3c-rdfcore-wg/2001Sep/0273.html 449 519 http://www.w3.org/TR/voicexml/ I checked a total of 449 documents to find the 20 above that didn't validate. Among the top 449 documents served, 98.70% of the page views were valid HTML. These documents account for 49.92% of the page views on our site, so 49.92% is a minimum bound on the amount of our traffic which is valid HTML. (all the numbers above exclude Apache's invalid DirectoryIndexes.) Here is a list of the top 100 documents overall, valid or invalid: Rank Hits URI ------ ------- --------------------------------------------------------- [...] Last week's report was: mid:20010924184224.tid-18643@w3.org This message was generated automatically by /u1/stats/bin/top-invalid-docs on www15.w3.org, last modified $Date: 2000/09/28 04:37:10 $. If you have any questions about this report, please contact <baz@example.org>. ----- End forwarded message ----- -- Gerald Oskoboiny http://www.w3.org/People/Gerald/ World Wide Web Consortium (W3C) http://www.w3.org/ tel:+1-613-261-6630 mailto:gerald@w3.org
Attachments
- text/plain attachment: top-invalid-docs
Received on Saturday, 29 September 2001 04:42:25 UTC