Building a test corpus for ER tools from Nick Kew on 2002-03-06 (w3c-wai-er-ig@w3.org from March 2002)

From: Nick Kew <nick@webthing.com>
Date: Wed, 6 Mar 2002 21:59:03 +0000 (GMT)
To: <w3c-wai-er-ig@w3.org>
Message-ID: <20020306210051.B1308-100000@fenris.webthing.com>

It occurs to me that this has been discussed a little on IRC, but that
posting here will reach a wider readership.  So here goes.  I hope
this makes sense, though it may very well not :-)

Some of us have been using the W3C annotea server to store EARL.
I recently completed a posting agent that enables Valet to post
directly to the server.  One thought that occurred to me was
to make this operational, so that Page Valet would save its own
results and use a cron job to submit to Annotea.  This could
rapidly build a corpus of test results, that could be matched
to evaluations on the pages concerned by other agents - in
particular human agents.

Now, EARL and Annotea are different, and as things stand, embedding
earl in annotea is a clumsy hack.  Furthermore, ericP (of the Annotea
server) doesn't think it's very sensible, and I take his point
insofar as lots of EARL evaluations might be seen as "noise" in
his database.  OTOH the server and client technologies developed
for Annotea could certainly be adapted for EARL, and become the
basis for building a corpus.  The exercise of doing so will of course
also be valuable real-life experience with EARL in a larger-scale
project than has hitherto been undertaken.

If I were to hypothesise a corpus-building project, what support might
it expect:

 * Set up a new EARL server, based on the Annotea server.  Ensure that
   it properly tracks changes in pages, as this is more important to
   us than to Annotea.
 * Adapt existing Annotea Clients to work with it.
 * Document the differences between us and annotea.
 * Use automated processes (such as Page Valet users, and/or spiders)
   to assemble a corpus.  There is an issue here of unrepresentative
   sampling, but that's probably not important.
 * Use this infrastructure to collect human evaluations, either "blind"
   (just evaluate a page) or in the context of existing evaluations.
 * Develop a methodology for evaluating tools against the same corpus.


I believe this will help developers like me to improve our tools, but
the more important result will be new insights into the WCAG itself.
This latter might for example arise where human evaluators disagree with
Valet, but it is clear that Valet is right according to WCAG.


-- 
Nick Kew

Site Valet - the mark of Quality on the Web.
<URL:http://valet.webthing.com/>

Received on Wednesday, 6 March 2002 16:59:09 UTC