Re: Confidences in accessibilty evaluation from Giorgio Brajnik on 2005-07-12 (public-wai-ert@w3.org from July 2005)

From: Giorgio Brajnik <giorgio.brajnik@gmail.com>
Date: Tue, 12 Jul 2005 20:04:04 +0200
To: Nick Kew <nick@webthing.com>
Cc: "public-wai-ert@w3.org" <public-wai-ert@w3.org>
Message-ID: <b00f67220507121104d592d15@mail.gmail.com>
I think that a deeper insight into the underlying model might be
useful. Let me suggest again to go back to probabilities.

First of all, tests might have a non-binary applicability. 

For example, a test like "any given IMG should have a syntactically
valid ALT attribute" would have a certain (100%) applicability ; a
test like "if a given TABLE represents a data-table, then the first
row and first column should be made of TH" will probably have an
applicability that is below 100% (unless the tool uses a perfect
criterion for distinction, which as far as I know is not available
yet). Applicability might be defined like "probability that the test
applies correctly to the given element of the page".

Second, whatever is the representation of the outcomes of a test (eg
pass/fail) I think this outcome should also be bound to probabilities.
For example we could talk about the probability prob{outcome of test T
applied to element E is "fail"}=0.6.

Third, applicability of the test to an element and probability of the
test giving fail should be combined (by multiplying them). In this way
a test (for example the one with data tables mentioned above) might
have applicability=0.5 and it might yield fail with prob. 0.8. This
means that the overall probability (of the result being fail) is 0.4
and prob. of being success is 0.1.

If the tool user has decided some thresholds (or other criteria) to
sift these probabilities, then one could get the desired outcomes
(success/fail/cannottell). For example, if the overall probability of
outcome being fail is above .7 then output=FAIL; if probab. of outcome
being success is above .7 then output=SUCCESS; else output=CANNOTTELL.
Notice that it is the tool user, or perhaps the user of the EARL
report produced by the tool, that might want to define these
thresholds/criteria.

How these probabilities are computed/determined is another story (for
example, one could do some experiments using the tool and sampling the
number of times that the tool produced the wrong output with each of
the tests). Or one could associate values 100% or 0% to hand-made
evaluations.
But in any case, I think EARL should allow one to state applicability
of a test and outcome of a test in terms of probabilities. Having a
clearly defined underlying model helps in deciding how to represent
information, and therefore in deciding how to use them. And having a
detailed model does not require that any user of the EARL language is
required to use that detail. If appropriate abstractions can be
defined in the language, then the underlying model might get unnoticed
by most users.

My best,

-- 
        Giorgio Brajnik
______________________________________________________________________
Dip. di Matematica e Informatica   | voice: +39 (0432) 55.8445
Università di Udine                | fax:   +39 (0432) 55.8499
Via delle Scienze, 206             | email: giorgio@dimi.uniud.it
Loc. Rizzi -- 33100 Udine -- ITALY | http://www.dimi.uniud.it/giorgio


On 7/12/05, Nick Kew <nick@webthing.com> wrote:
> 
> Confidence Values in Accessibility Evaluation
> 
> Some weeks ago I took an action to write a note about confidence
> values, noting that it was going to be a while before I'd have
> time to write anything.
> 
> As I recollect it, the gist of the discussion was how
> confidences (High/Medium/Low) relate to outcomes
> (Pass/Fail/CannotTell), and why a low-confidence fail
> needs to be distinct from a pass at any confidence level.
> 
> The crucial point here is that accessibility analysis happens
> on different levels.  Different levels of analysis call for
> different vocabularies, with of course a common core.
> 
> Firstly, to deal with the confidences themselves.  They are
> largely arbitrary, but are designed to deal with differences
> in the likelihood of an individual test indicating a violation
> of the guidelines.  Note that this is totally orthogonal to
> the importance (A/AA/AAA) of any given violation.
> 
> To take a few examples, going from high to low confidence:
>  * An IMG with no ALT is a violation.  There is no doubt, and
>    the tool can say so with certainty.  An IMG with ALT=SPACER
>    is almost certainly a violation, but might be correct in
>    exceptional cases.
>  * A BLOCKQUOTE may or may not be a violation.  Since BLOCKQUOTE
>    is widely abused for indentation, any particular use of the
>    element is at quite a high risk of being an abuse, unless
>    the tool can infer otherwise.  If it has a cite=... attribute
>    then the tool can indeed infer that the usage is correct.
>    If it doesn't, the tool should ask the evaluator to verify it.
>  * A STYLE=... attribute to any element might possibly be used to
>    convey vital information that becomes inaccessible without it.
>    But this is rare in practice.  If the tool cannot tell whether
>    a style is safe, it should flag up a note just to alert the
>    evaluator.
> 
> Now when evaluating a page, a tool may apply thousands of such tests,
> and the tool developer's most difficult task is to find a middle way
> between omitting important detail and overwhelming the user with
> mostly-irrelevant detail (valet has been criticised simultaneously
> for offering too much or too little detail, and deals with the
> problem primarily by offering the user a choice of presentations
> to meet differing needs and expectations).
> Given thousands of individual tests, most of which a page passes,
> it is certainly not helpful to present the user with every
> irrelevant result.  Instead of recording every test passed, it
> should refer the user to general documentation, which will
> point out for example that the tool tests all images for ALT
> attributes.  If no warnings were generated you can infer that
> all images have them.
> 
> That means that *every* test result within a detailed page report
> is a Fail or CannotTell.  If the tests themselves are implemented
> as binary pass/fail, then it is always a Fail of the tool's test
> that we report.  The tool may designate some tests as certain and
> others as uncertain, but that's a crude distinction and barely
> more helpful than old-fashioned Bobby's manual checks - which
> consistently get ignored - or indeed the ultimate reduction of
> including the entire guidelines as manual checks in every report!
> 
> This is where confidence levels are helpful.  Though arbitrary
> and imperfect, they provide a much finer and more useful distinction
> than simply a page full of "cannottell" results.  Every confidence
> reported represents the tool's confidence that the test failed.
> 
> They may also be used in compiling whole-page results from the
> individual warnings.  A page containing deprecated markup is
> an automatic fail at WCAG-AA or higher, whereas a valid/strict
> page containing lower-confidence warnings gets flagged up as
> CannotTell - or some variant on that.  A page that has been
> verified by the evaluator ticking every automatic warning as
> "condition satisfied" or "not applicable" - or indeed a page
> that generates no warnings whatsoever - is flagged as a Pass.
> This whole-page level, and upwards to Sites and Applications,
> calls for the Pass/Fail/CannotTell vocabulary that is
> irrelevant within a detailed page analysis.
> 
> --
> Nick Kew
> 
>
Received on Tuesday, 12 July 2005 18:04:09 UTC