Re: useful QA gizmo: check popular docs for HTML validity

On Sat, Sep 29, 2001 at 04:41:51AM -0400, Gerald Oskoboiny wrote:
> Attached is a script we use at W3C to improve the quality of our
> site, by checking to make sure the most visited pages on our site
> are valid HTML. At the bottom of this message is sample output
> from the script.

Here is a bit of commentary on the code itself... some of it must
seem pretty obscure.

> # top-invalid-docs: generate and mail a report of the most popular
> # invalid HTML documents

> # [slightly altered version of
> # $Id: top-invalid-docs,v 1.13 2000/09/28 04:37:10 gerald Exp $ ]

fyi, the main things I altered were the email addresses, to avoid
us getting extra spam to these addresses, and to avoid getting
reports from sites that might happen to copy the code and forget
to change the email addresses.

> $days   = 4;

I use 4 days worth of logs rather than a whole week because the
logs are so large that I think trying to use 7 days would kill
the machine running this script due to lack of enough memory
for the big "sort | uniq | sort" pipeline.

> while (<LOG>) {
>     chomp;
>     ($server) = (/([^ ]+)$/);		# grab the last word of the line

>     @f = split;
>     next unless $f[1] =~ /^2/;
>     next unless $f[11] =~ /text\/html/i;

This stuff parses our weird log format, which is created with
this entry in our Apache httpd.conf (wrapped for readability):

    W3C_CustomLog logs/complete_log "%{%Y-%m-%dT%H:%M:%SZ}t %s %b
    %T %f %h %u \"%r\" \"%{Referer}i\" \"%{Content-Type}o\"
    %%\"%{Last-Modified}o\" \"%{User-agent}i\" \"%{Host}i\""

hmm... I'm not sure why that's W3C_CustomLog instead of just
CustomLog; I think it's there because we hacked our Apache to log
an extra item not provided by the regular module (the amount of
time it took to serve each request.) Here's documentation on the
regular CustomLog directive:

    http://httpd.apache.org/docs/mod/mod_log_config.html#customlog
    http://httpd.apache.org/docs/mod/mod_log_config.html#formats

And here's a sample entry from our logs (wrapped again):

    2001-09-29T04:56:20Z 200 2414 na
    /usr/local/validator/htdocs/images/vxhtml10.png 64.230.72.87 -
    "GET /images/vxhtml10 HTTP/1.0" "http://validator.w3.org/"
    "image/png" "Fri, 14 Sep 2001 02:00:46 GMT"
    "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:0.9.4) Gecko/20010913"
    "validator.w3.org"

(normally it would be a violation of our privacy policy to
release this info, but this log entry was made by me so I'm
pretty sure that's okay. :)

> if ( length( $prevmsgid ) ) {
>     $references = "\nIn-Reply-To: <$prevmsgid>\nReferences: <$prevmsgid>";
>     $last_week_text = "\nLast week's report was: mid:$prevmsgid\n";
> }

This msgid business is here so the messages will be part of the
same email thread. (probably not necessary.)

> print MAIL <<"EOHD";
> Subject: Most popular documents on our site, $nice_ymd

Originally the subject was "most popular invalid documents..."
and it didn't include the list of top documents overall, but I
bowed to peer pressure and started including that info as well,
then changed the subject accordingly

>     open( WGET, "$lynx -source http://validator.w3.org/check\?uri=$uri | " ) ||
> 	warn "couldn't open pipe to lynx for URI $uri! $!";

>     if(grep(/Sorry/,@results)) {

Obviously, this 'grep Sorry' stuff isn't a very precise way to
see if the document is valid or not; we hope to add some kind of
mode to the W3C HTML validator to provide results in a machine-
readable format. (XML, RDF, or EARL or something; suggestions
welcome.)

> 	# uncomment this if/when directoryindexes are included in the report
>         # $uri =~ s,/$,/  [ Apache DirectoryIndex ],;

Hmm... it's kind of a long story why I decided to exclude Apache
directoryindexes from the report; basically, I don't think that
particular form of invalidity is very harmful, nobody really
wanted to invest the time necessary to make these indexes valid,
and some of the early reports would just be cluttered up with
info about directoryindexes we didn't plan to fix... so I decided
to exclude them from the reports (and stats :/) entirely. It would
be good for Someone to get around to fixing that one day. (that is,
if it hasn't been fixed already in the most recent Apaches.)

-- 
Gerald Oskoboiny     http://www.w3.org/People/Gerald/
World Wide Web Consortium (W3C)    http://www.w3.org/
tel:+1-613-261-6630             mailto:gerald@w3.org

Received on Saturday, 29 September 2001 05:21:40 UTC