- From: Gerald Oskoboiny <gerald@w3.org>
- Date: Sat, 29 Sep 2001 04:41:51 -0400
- To: www-qa@w3.org, www-validator@w3.org
- Message-ID: <20010929044151.B13029@w3.org>
Attached is a script we use at W3C to improve the quality of our
site, by checking to make sure the most visited pages on our site
are valid HTML. At the bottom of this message is sample output
from the script.
Although we would like our site to be 100% valid, we have
hundreds of thousands of documents on our site including many
that are there for historical interest (some of them predating
formal HTML specifications), and it isn't practical for us to
make them all valid HTML.
Many sites would just remove these old documents instead of
leaving them up there basically unmaintained, but I think we
generally feel it's better to have this information online in
an invalid form than not have it available at all. Ideally, we
would have the manpower or technology to be able to go through
and make them all valid, but in practice we have too many other
things competing for our time to make that feasible.
Also, for some of the documents we have a policy of not changing
them at all once they are published -- for example, our policy is
not to change dated versions of documents in our TR space [1],
so people can be sure they are discussing the same thing when
talking about a particular version of a specification without
having to compare checksums or something.
So in order to help us focus our effort where it will do the
most good, I wrote a script to check the most-visited pages on
our site for HTML validity, using the online HTML validator [2];
this script is run once a week, and sends mail to the whole
W3C team notifying us of the top invalid documents on our site.
(it sends mail to the whole team rather than individuals partly
to act as a form of peer pressure: most people don't want their
areas to show up in this report week after week.)
When we first ran this report against the documents on www.w3.org
in September 1999, 20 of the top 53 documents on our site were
invalid. In the most recent report (below), 20 of the top 449
documents are valid, and we know that more than half of the page
views delivered to users are valid HTML. That's quite an
improvement, and most sites should be able to do much better than
that; I think we have an unusually large number of documents that
are there for historical interest.
This script won't be directly usable by other sites, because it
relies on a bunch of things that are specific to our site (our
log rotation scheme, our weird log file format, all the various
DocumentRoots on our servers, etc.) But it might be useful as an
example for other sites.
There are a number of obvious improvements that could be made,
for example: keep a local cache of last-modified dates for each
document and don't revalidate them each week if they haven't
changed.
[1] http://www.w3.org/TR/
[2] http://validator.w3.org/
----- Forwarded message from Gerald Oskoboiny <gerald@w3.org> -----
Date: Wed, 26 Sep 2001 05:25:08 -0400 (EDT)
From: Gerald Oskoboiny <gerald@w3.org>
Subject: Most popular documents on our site, 2001-09-25
To: bar@example.org
Here are the most frequently-requested HTML documents on our site
that do not validate, with the their overall rank and the number
of times they were requested from our site in the last 4 days.
Rank Hits URI
------ ------- ---------------------------------------------------------
80 2165 http://www.w3.org/TR/wsdl.html
105 1708 http://www.w3.org/Security/Faq/www-security-faq.html
132 1423 http://www.w3.org/Security/Faq/
204 1007 http://www.w3.org/2001/XMLSchema.html
240 898 http://www.w3.org/P3P/compliant_sites.php3
263 831 http://www.w3.org/2000/01/sw/
266 822 http://www.w3.org/MarkUp/html-spec/html-spec_8.html
269 812 http://www.w3.org/TR/NOTE-VML.html
278 796 http://www.w3.org/TR/REC-png.html
320 685 http://www.w3.org/PICS/iacwcv2.htm
335 642 http://www.w3.org/TR/NOTE-datetime.html
355 618 http://www.w3.org/TR/1998/NOTE-XML-data-0105/
366 603 http://www.w3.org/TR/1998/NOTE-compactHTML-19980209/
382 585 http://www.w3.org/RDF/Validator/
402 563 http://www.w3.org/TR/WD-logfile.html
406 562 http://www.w3.org/MarkUp/html-spec/html-spec_5.html
420 549 http://www.w3.org/TR/smil-animation/
429 539 http://www.w3.org/Style/XSL/WhatIsXSL.html
438 532 http://lists.w3.org/Archives/Public/w3c-rdfcore-wg/2001Sep/0273.html
449 519 http://www.w3.org/TR/voicexml/
I checked a total of 449 documents to find the 20 above that didn't validate.
Among the top 449 documents served, 98.70% of the page views were valid HTML.
These documents account for 49.92% of the page views on our site, so 49.92%
is a minimum bound on the amount of our traffic which is valid HTML.
(all the numbers above exclude Apache's invalid DirectoryIndexes.)
Here is a list of the top 100 documents overall, valid or invalid:
Rank Hits URI
------ ------- ---------------------------------------------------------
[...]
Last week's report was: mid:20010924184224.tid-18643@w3.org
This message was generated automatically by /u1/stats/bin/top-invalid-docs
on www15.w3.org, last modified $Date: 2000/09/28 04:37:10 $.
If you have any questions about this report, please contact <baz@example.org>.
----- End forwarded message -----
--
Gerald Oskoboiny http://www.w3.org/People/Gerald/
World Wide Web Consortium (W3C) http://www.w3.org/
tel:+1-613-261-6630 mailto:gerald@w3.org
Attachments
- text/plain attachment: top-invalid-docs
Received on Saturday, 29 September 2001 04:42:25 UTC