useful QA gizmo: check popular docs for HTML validity

From: Gerald Oskoboiny (gerald@w3.org)
Date: Sat, Sep 29 2001

  • Next message: Gerald Oskoboiny: "Re: useful QA gizmo: check popular docs for HTML validity"

    Date: Sat, 29 Sep 2001 04:41:51 -0400
    From: Gerald Oskoboiny <gerald@w3.org>
    To: www-qa@w3.org, www-validator@w3.org
    Message-ID: <20010929044151.B13029@w3.org>
    Subject: useful QA gizmo: check popular docs for HTML validity
    
    
    Attached is a script we use at W3C to improve the quality of our
    site, by checking to make sure the most visited pages on our site
    are valid HTML. At the bottom of this message is sample output
    from the script.
    
    Although we would like our site to be 100% valid, we have
    hundreds of thousands of documents on our site including many
    that are there for historical interest (some of them predating
    formal HTML specifications), and it isn't practical for us to
    make them all valid HTML.
    
    Many sites would just remove these old documents instead of
    leaving them up there basically unmaintained, but I think we
    generally feel it's better to have this information online in
    an invalid form than not have it available at all. Ideally, we
    would have the manpower or technology to be able to go through
    and make them all valid, but in practice we have too many other
    things competing for our time to make that feasible.
    
    Also, for some of the documents we have a policy of not changing
    them at all once they are published -- for example, our policy is
    not to change dated versions of documents in our TR space [1],
    so people can be sure they are discussing the same thing when
    talking about a particular version of a specification without
    having to compare checksums or something.
    
    So in order to help us focus our effort where it will do the
    most good, I wrote a script to check the most-visited pages on
    our site for HTML validity, using the online HTML validator [2];
    this script is run once a week, and sends mail to the whole
    W3C team notifying us of the top invalid documents on our site.
    (it sends mail to the whole team rather than individuals partly
    to act as a form of peer pressure: most people don't want their
    areas to show up in this report week after week.)
    
    When we first ran this report against the documents on www.w3.org
    in September 1999, 20 of the top 53 documents on our site were
    invalid. In the most recent report (below), 20 of the top 449
    documents are valid, and we know that more than half of the page
    views delivered to users are valid HTML. That's quite an
    improvement, and most sites should be able to do much better than
    that; I think we have an unusually large number of documents that
    are there for historical interest.
    
    This script won't be directly usable by other sites, because it
    relies on a bunch of things that are specific to our site (our
    log rotation scheme, our weird log file format, all the various
    DocumentRoots on our servers, etc.) But it might be useful as an
    example for other sites.
    
    There are a number of obvious improvements that could be made,
    for example: keep a local cache of last-modified dates for each
    document and don't revalidate them each week if they haven't
    changed.
    
    [1] http://www.w3.org/TR/
    [2] http://validator.w3.org/
    
    ----- Forwarded message from Gerald Oskoboiny <gerald@w3.org> -----
    
    Date: Wed, 26 Sep 2001 05:25:08 -0400 (EDT)
    From: Gerald Oskoboiny <gerald@w3.org>
    Subject: Most popular documents on our site, 2001-09-25
    To: bar@example.org
    
    Here are the most frequently-requested HTML documents on our site
    that do not validate, with the their overall rank and the number
    of times they were requested from our site in the last 4 days.
    
       Rank   Hits    URI
      ------ ------- ---------------------------------------------------------
        80     2165   http://www.w3.org/TR/wsdl.html
       105     1708   http://www.w3.org/Security/Faq/www-security-faq.html
       132     1423   http://www.w3.org/Security/Faq/
       204     1007   http://www.w3.org/2001/XMLSchema.html
       240      898   http://www.w3.org/P3P/compliant_sites.php3
       263      831   http://www.w3.org/2000/01/sw/
       266      822   http://www.w3.org/MarkUp/html-spec/html-spec_8.html
       269      812   http://www.w3.org/TR/NOTE-VML.html
       278      796   http://www.w3.org/TR/REC-png.html
       320      685   http://www.w3.org/PICS/iacwcv2.htm
       335      642   http://www.w3.org/TR/NOTE-datetime.html
       355      618   http://www.w3.org/TR/1998/NOTE-XML-data-0105/
       366      603   http://www.w3.org/TR/1998/NOTE-compactHTML-19980209/
       382      585   http://www.w3.org/RDF/Validator/
       402      563   http://www.w3.org/TR/WD-logfile.html
       406      562   http://www.w3.org/MarkUp/html-spec/html-spec_5.html
       420      549   http://www.w3.org/TR/smil-animation/
       429      539   http://www.w3.org/Style/XSL/WhatIsXSL.html
       438      532   http://lists.w3.org/Archives/Public/w3c-rdfcore-wg/2001Sep/0273.html
       449      519   http://www.w3.org/TR/voicexml/
    
    I checked a total of 449 documents to find the 20 above that didn't validate.
         
    Among the top 449 documents served, 98.70% of the page views were valid HTML.
    
    These documents account for 49.92% of the page views on our site, so 49.92%
    is a minimum bound on the amount of our traffic which is valid HTML.
    
    (all the numbers above exclude Apache's invalid DirectoryIndexes.)
    
    Here is a list of the top 100 documents overall, valid or invalid:
    
       Rank    Hits   URI
      ------ ------- ---------------------------------------------------------
    [...]
    
    Last week's report was: mid:20010924184224.tid-18643@w3.org
    
    This message was generated automatically by /u1/stats/bin/top-invalid-docs
    on www15.w3.org, last modified $Date: 2000/09/28 04:37:10 $.
     
    If you have any questions about this report, please contact <baz@example.org>.
      
    
    ----- End forwarded message -----
    
    -- 
    Gerald Oskoboiny     http://www.w3.org/People/Gerald/
    World Wide Web Consortium (W3C)    http://www.w3.org/
    tel:+1-613-261-6630             mailto:gerald@w3.org