useful QA gizmo: check popular docs for HTML validity

Attached is a script we use at W3C to improve the quality of our
site, by checking to make sure the most visited pages on our site
are valid HTML. At the bottom of this message is sample output
from the script.

Although we would like our site to be 100% valid, we have
hundreds of thousands of documents on our site including many
that are there for historical interest (some of them predating
formal HTML specifications), and it isn't practical for us to
make them all valid HTML.

Many sites would just remove these old documents instead of
leaving them up there basically unmaintained, but I think we
generally feel it's better to have this information online in
an invalid form than not have it available at all. Ideally, we
would have the manpower or technology to be able to go through
and make them all valid, but in practice we have too many other
things competing for our time to make that feasible.

Also, for some of the documents we have a policy of not changing
them at all once they are published -- for example, our policy is
not to change dated versions of documents in our TR space [1],
so people can be sure they are discussing the same thing when
talking about a particular version of a specification without
having to compare checksums or something.

So in order to help us focus our effort where it will do the
most good, I wrote a script to check the most-visited pages on
our site for HTML validity, using the online HTML validator [2];
this script is run once a week, and sends mail to the whole
W3C team notifying us of the top invalid documents on our site.
(it sends mail to the whole team rather than individuals partly
to act as a form of peer pressure: most people don't want their
areas to show up in this report week after week.)

When we first ran this report against the documents on
in September 1999, 20 of the top 53 documents on our site were
invalid. In the most recent report (below), 20 of the top 449
documents are valid, and we know that more than half of the page
views delivered to users are valid HTML. That's quite an
improvement, and most sites should be able to do much better than
that; I think we have an unusually large number of documents that
are there for historical interest.

This script won't be directly usable by other sites, because it
relies on a bunch of things that are specific to our site (our
log rotation scheme, our weird log file format, all the various
DocumentRoots on our servers, etc.) But it might be useful as an
example for other sites.

There are a number of obvious improvements that could be made,
for example: keep a local cache of last-modified dates for each
document and don't revalidate them each week if they haven't


----- Forwarded message from Gerald Oskoboiny <> -----

Date: Wed, 26 Sep 2001 05:25:08 -0400 (EDT)
From: Gerald Oskoboiny <>
Subject: Most popular documents on our site, 2001-09-25

Here are the most frequently-requested HTML documents on our site
that do not validate, with the their overall rank and the number
of times they were requested from our site in the last 4 days.

   Rank   Hits    URI
  ------ ------- ---------------------------------------------------------
    80     2165
   105     1708
   132     1423
   204     1007
   240      898
   263      831
   266      822
   269      812
   278      796
   320      685
   335      642
   355      618
   366      603
   382      585
   402      563
   406      562
   420      549
   429      539
   438      532
   449      519

I checked a total of 449 documents to find the 20 above that didn't validate.
Among the top 449 documents served, 98.70% of the page views were valid HTML.

These documents account for 49.92% of the page views on our site, so 49.92%
is a minimum bound on the amount of our traffic which is valid HTML.

(all the numbers above exclude Apache's invalid DirectoryIndexes.)

Here is a list of the top 100 documents overall, valid or invalid:

   Rank    Hits   URI
  ------ ------- ---------------------------------------------------------

Last week's report was:

This message was generated automatically by /u1/stats/bin/top-invalid-docs
on, last modified $Date: 2000/09/28 04:37:10 $.
If you have any questions about this report, please contact <>.

----- End forwarded message -----

Gerald Oskoboiny
World Wide Web Consortium (W3C)

Received on Saturday, 29 September 2001 04:42:25 UTC