- From: Gerald Oskoboiny <gerald@w3.org>
- Date: Sat, 29 Sep 2001 05:21:01 -0400
- To: www-qa@w3.org, www-validator@w3.org
On Sat, Sep 29, 2001 at 04:41:51AM -0400, Gerald Oskoboiny wrote: > Attached is a script we use at W3C to improve the quality of our > site, by checking to make sure the most visited pages on our site > are valid HTML. At the bottom of this message is sample output > from the script. Here is a bit of commentary on the code itself... some of it must seem pretty obscure. > # top-invalid-docs: generate and mail a report of the most popular > # invalid HTML documents > # [slightly altered version of > # $Id: top-invalid-docs,v 1.13 2000/09/28 04:37:10 gerald Exp $ ] fyi, the main things I altered were the email addresses, to avoid us getting extra spam to these addresses, and to avoid getting reports from sites that might happen to copy the code and forget to change the email addresses. > $days = 4; I use 4 days worth of logs rather than a whole week because the logs are so large that I think trying to use 7 days would kill the machine running this script due to lack of enough memory for the big "sort | uniq | sort" pipeline. > while (<LOG>) { > chomp; > ($server) = (/([^ ]+)$/); # grab the last word of the line > @f = split; > next unless $f[1] =~ /^2/; > next unless $f[11] =~ /text\/html/i; This stuff parses our weird log format, which is created with this entry in our Apache httpd.conf (wrapped for readability): W3C_CustomLog logs/complete_log "%{%Y-%m-%dT%H:%M:%SZ}t %s %b %T %f %h %u \"%r\" \"%{Referer}i\" \"%{Content-Type}o\" %%\"%{Last-Modified}o\" \"%{User-agent}i\" \"%{Host}i\"" hmm... I'm not sure why that's W3C_CustomLog instead of just CustomLog; I think it's there because we hacked our Apache to log an extra item not provided by the regular module (the amount of time it took to serve each request.) Here's documentation on the regular CustomLog directive: http://httpd.apache.org/docs/mod/mod_log_config.html#customlog http://httpd.apache.org/docs/mod/mod_log_config.html#formats And here's a sample entry from our logs (wrapped again): 2001-09-29T04:56:20Z 200 2414 na /usr/local/validator/htdocs/images/vxhtml10.png 64.230.72.87 - "GET /images/vxhtml10 HTTP/1.0" "http://validator.w3.org/" "image/png" "Fri, 14 Sep 2001 02:00:46 GMT" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:0.9.4) Gecko/20010913" "validator.w3.org" (normally it would be a violation of our privacy policy to release this info, but this log entry was made by me so I'm pretty sure that's okay. :) > if ( length( $prevmsgid ) ) { > $references = "\nIn-Reply-To: <$prevmsgid>\nReferences: <$prevmsgid>"; > $last_week_text = "\nLast week's report was: mid:$prevmsgid\n"; > } This msgid business is here so the messages will be part of the same email thread. (probably not necessary.) > print MAIL <<"EOHD"; > Subject: Most popular documents on our site, $nice_ymd Originally the subject was "most popular invalid documents..." and it didn't include the list of top documents overall, but I bowed to peer pressure and started including that info as well, then changed the subject accordingly > open( WGET, "$lynx -source http://validator.w3.org/check\?uri=$uri | " ) || > warn "couldn't open pipe to lynx for URI $uri! $!"; > if(grep(/Sorry/,@results)) { Obviously, this 'grep Sorry' stuff isn't a very precise way to see if the document is valid or not; we hope to add some kind of mode to the W3C HTML validator to provide results in a machine- readable format. (XML, RDF, or EARL or something; suggestions welcome.) > # uncomment this if/when directoryindexes are included in the report > # $uri =~ s,/$,/ [ Apache DirectoryIndex ],; Hmm... it's kind of a long story why I decided to exclude Apache directoryindexes from the report; basically, I don't think that particular form of invalidity is very harmful, nobody really wanted to invest the time necessary to make these indexes valid, and some of the early reports would just be cluttered up with info about directoryindexes we didn't plan to fix... so I decided to exclude them from the reports (and stats :/) entirely. It would be good for Someone to get around to fixing that one day. (that is, if it hasn't been fixed already in the most recent Apaches.) -- Gerald Oskoboiny http://www.w3.org/People/Gerald/ World Wide Web Consortium (W3C) http://www.w3.org/ tel:+1-613-261-6630 mailto:gerald@w3.org
Received on Saturday, 29 September 2001 05:21:40 UTC