Confidences in accessibilty evaluation from Nick Kew on 2005-07-12 (public-wai-ert@w3.org from July 2005)

From: Nick Kew <nick@webthing.com>
Date: Tue, 12 Jul 2005 17:08:08 +0100
To: "public-wai-ert@w3.org" <public-wai-ert@w3.org>
Message-Id: <200507121708.12193.nick@webthing.com>
Confidence Values in Accessibility Evaluation

Some weeks ago I took an action to write a note about confidence
values, noting that it was going to be a while before I'd have
time to write anything.

As I recollect it, the gist of the discussion was how
confidences (High/Medium/Low) relate to outcomes
(Pass/Fail/CannotTell), and why a low-confidence fail
needs to be distinct from a pass at any confidence level.

The crucial point here is that accessibility analysis happens
on different levels.  Different levels of analysis call for
different vocabularies, with of course a common core.

Firstly, to deal with the confidences themselves.  They are
largely arbitrary, but are designed to deal with differences
in the likelihood of an individual test indicating a violation
of the guidelines.  Note that this is totally orthogonal to
the importance (A/AA/AAA) of any given violation.

To take a few examples, going from high to low confidence:
 * An IMG with no ALT is a violation.  There is no doubt, and
   the tool can say so with certainty.  An IMG with ALT=SPACER
   is almost certainly a violation, but might be correct in
   exceptional cases.
 * A BLOCKQUOTE may or may not be a violation.  Since BLOCKQUOTE
   is widely abused for indentation, any particular use of the
   element is at quite a high risk of being an abuse, unless
   the tool can infer otherwise.  If it has a cite=... attribute
   then the tool can indeed infer that the usage is correct.
   If it doesn't, the tool should ask the evaluator to verify it.
 * A STYLE=... attribute to any element might possibly be used to
   convey vital information that becomes inaccessible without it.
   But this is rare in practice.  If the tool cannot tell whether
   a style is safe, it should flag up a note just to alert the
   evaluator.

Now when evaluating a page, a tool may apply thousands of such tests,
and the tool developer's most difficult task is to find a middle way
between omitting important detail and overwhelming the user with
mostly-irrelevant detail (valet has been criticised simultaneously
for offering too much or too little detail, and deals with the
problem primarily by offering the user a choice of presentations
to meet differing needs and expectations).
Given thousands of individual tests, most of which a page passes,
it is certainly not helpful to present the user with every
irrelevant result.  Instead of recording every test passed, it
should refer the user to general documentation, which will
point out for example that the tool tests all images for ALT
attributes.  If no warnings were generated you can infer that
all images have them.

That means that *every* test result within a detailed page report
is a Fail or CannotTell.  If the tests themselves are implemented
as binary pass/fail, then it is always a Fail of the tool's test
that we report.  The tool may designate some tests as certain and
others as uncertain, but that's a crude distinction and barely
more helpful than old-fashioned Bobby's manual checks - which
consistently get ignored - or indeed the ultimate reduction of
including the entire guidelines as manual checks in every report!

This is where confidence levels are helpful.  Though arbitrary
and imperfect, they provide a much finer and more useful distinction
than simply a page full of "cannottell" results.  Every confidence
reported represents the tool's confidence that the test failed.

They may also be used in compiling whole-page results from the
individual warnings.  A page containing deprecated markup is
an automatic fail at WCAG-AA or higher, whereas a valid/strict
page containing lower-confidence warnings gets flagged up as
CannotTell - or some variant on that.  A page that has been
verified by the evaluator ticking every automatic warning as
"condition satisfied" or "not applicable" - or indeed a page
that generates no warnings whatsoever - is flagged as a Pass.
This whole-page level, and upwards to Sites and Applications,
calls for the Pass/Fail/CannotTell vocabulary that is
irrelevant within a detailed page analysis.

-- 
Nick Kew
Received on Tuesday, 12 July 2005 16:21:52 UTC