Re: useful QA gizmo: check popular docs for HTML validity

From: Gerald Oskoboiny (gerald@w3.org)
Date: Sat, Sep 29 2001

  • Next message: Gerald Oskoboiny: "Re: useful QA gizmo: check popular docs for HTML validity"

    Date: Sat, 29 Sep 2001 05:21:01 -0400
    From: Gerald Oskoboiny <gerald@w3.org>
    To: www-qa@w3.org, www-validator@w3.org
    Message-ID: <20010929052101.C13029@w3.org>
    Subject: Re: useful QA gizmo: check popular docs for HTML validity
    
    On Sat, Sep 29, 2001 at 04:41:51AM -0400, Gerald Oskoboiny wrote:
    > Attached is a script we use at W3C to improve the quality of our
    > site, by checking to make sure the most visited pages on our site
    > are valid HTML. At the bottom of this message is sample output
    > from the script.
    
    Here is a bit of commentary on the code itself... some of it must
    seem pretty obscure.
    
    > # top-invalid-docs: generate and mail a report of the most popular
    > # invalid HTML documents
    
    > # [slightly altered version of
    > # $Id: top-invalid-docs,v 1.13 2000/09/28 04:37:10 gerald Exp $ ]
    
    fyi, the main things I altered were the email addresses, to avoid
    us getting extra spam to these addresses, and to avoid getting
    reports from sites that might happen to copy the code and forget
    to change the email addresses.
    
    > $days   = 4;
    
    I use 4 days worth of logs rather than a whole week because the
    logs are so large that I think trying to use 7 days would kill
    the machine running this script due to lack of enough memory
    for the big "sort | uniq | sort" pipeline.
    
    > while (<LOG>) {
    >     chomp;
    >     ($server) = (/([^ ]+)$/);		# grab the last word of the line
    
    >     @f = split;
    >     next unless $f[1] =~ /^2/;
    >     next unless $f[11] =~ /text\/html/i;
    
    This stuff parses our weird log format, which is created with
    this entry in our Apache httpd.conf (wrapped for readability):
    
        W3C_CustomLog logs/complete_log "%{%Y-%m-%dT%H:%M:%SZ}t %s %b
        %T %f %h %u \"%r\" \"%{Referer}i\" \"%{Content-Type}o\"
        %%\"%{Last-Modified}o\" \"%{User-agent}i\" \"%{Host}i\""
    
    hmm... I'm not sure why that's W3C_CustomLog instead of just
    CustomLog; I think it's there because we hacked our Apache to log
    an extra item not provided by the regular module (the amount of
    time it took to serve each request.) Here's documentation on the
    regular CustomLog directive:
    
        http://httpd.apache.org/docs/mod/mod_log_config.html#customlog
        http://httpd.apache.org/docs/mod/mod_log_config.html#formats
    
    And here's a sample entry from our logs (wrapped again):
    
        2001-09-29T04:56:20Z 200 2414 na
        /usr/local/validator/htdocs/images/vxhtml10.png 64.230.72.87 -
        "GET /images/vxhtml10 HTTP/1.0" "http://validator.w3.org/"
        "image/png" "Fri, 14 Sep 2001 02:00:46 GMT"
        "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:0.9.4) Gecko/20010913"
        "validator.w3.org"
    
    (normally it would be a violation of our privacy policy to
    release this info, but this log entry was made by me so I'm
    pretty sure that's okay. :)
    
    > if ( length( $prevmsgid ) ) {
    >     $references = "\nIn-Reply-To: <$prevmsgid>\nReferences: <$prevmsgid>";
    >     $last_week_text = "\nLast week's report was: mid:$prevmsgid\n";
    > }
    
    This msgid business is here so the messages will be part of the
    same email thread. (probably not necessary.)
    
    > print MAIL <<"EOHD";
    > Subject: Most popular documents on our site, $nice_ymd
    
    Originally the subject was "most popular invalid documents..."
    and it didn't include the list of top documents overall, but I
    bowed to peer pressure and started including that info as well,
    then changed the subject accordingly
    
    >     open( WGET, "$lynx -source http://validator.w3.org/check\?uri=$uri | " ) ||
    > 	warn "couldn't open pipe to lynx for URI $uri! $!";
    
    >     if(grep(/Sorry/,@results)) {
    
    Obviously, this 'grep Sorry' stuff isn't a very precise way to
    see if the document is valid or not; we hope to add some kind of
    mode to the W3C HTML validator to provide results in a machine-
    readable format. (XML, RDF, or EARL or something; suggestions
    welcome.)
    
    > 	# uncomment this if/when directoryindexes are included in the report
    >         # $uri =~ s,/$,/  [ Apache DirectoryIndex ],;
    
    Hmm... it's kind of a long story why I decided to exclude Apache
    directoryindexes from the report; basically, I don't think that
    particular form of invalidity is very harmful, nobody really
    wanted to invest the time necessary to make these indexes valid,
    and some of the early reports would just be cluttered up with
    info about directoryindexes we didn't plan to fix... so I decided
    to exclude them from the reports (and stats :/) entirely. It would
    be good for Someone to get around to fixing that one day. (that is,
    if it hasn't been fixed already in the most recent Apaches.)
    
    -- 
    Gerald Oskoboiny     http://www.w3.org/People/Gerald/
    World Wide Web Consortium (W3C)    http://www.w3.org/
    tel:+1-613-261-6630             mailto:gerald@w3.org