Re: Confidences in accessibilty evaluation from Nick Kew on 2005-07-12 (public-wai-ert@w3.org from July 2005)

From: Nick Kew <nick@webthing.com>
Date: Wed, 13 Jul 2005 00:32:49 +0100
To: public-wai-ert@w3.org
Message-ID: <42D45321.20001@webthing.com>
Giorgio Brajnik wrote:
> I think that a deeper insight into the underlying model might be
> useful. Let me suggest again to go back to probabilities.

In principle I'd be happy with that.  But in practice, I don't
see how we can usefully work with probabilities other than as
very fuzzy numerical versions of High/Medium/Low.  That potentially
takes us to 1980s AI.

> First of all, tests might have a non-binary applicability. 
> 
> For example, a test like "any given IMG should have a syntactically
> valid ALT attribute" would have a certain (100%) applicability ; a
> test like "if a given TABLE represents a data-table, then the first
> row and first column should be made of TH" will probably have an
> applicability that is below 100% (unless the tool uses a perfect
> criterion for distinction, which as far as I know is not available
> yet). Applicability might be defined like "probability that the test
> applies correctly to the given element of the page".

Indeed.  But that's part of the test spec.  A related issue is
that a test may be applicable to more than one guideline: currently
we have no satisfactory way to express this, except by grouping
guidelines together.

I think that may be an argument for introducing new terms into EARL
(at the cost of added complexity).  I don't think it has a bearing
on confidence levels as currently used.

> Second, whatever is the representation of the outcomes of a test (eg
> pass/fail) I think this outcome should also be bound to probabilities.
> For example we could talk about the probability prob{outcome of test T
> applied to element E is "fail"}=0.6.

Are you saying 60% of all elements E will fail test T?

If so, that may be useful information about the *test*, but it's not
part of the *report*.  What the report is concerned with is the
probability that E fails *given the outcome of T in this instance*.

But probabilities are very hard to estimate.  Descriptive confidence
values are less open to misinterpretation and abuse than numerical
probability values.

> Third, applicability of the test to an element and probability of the
> test giving fail should be combined (by multiplying them). In this way
> a test (for example the one with data tables mentioned above) might
> have applicability=0.5 and it might yield fail with prob. 0.8. This
> means that the overall probability (of the result being fail) is 0.4
> and prob. of being success is 0.1.

How is that relevant to an assessment?  If the test has been applied
then there is an outcome.  If it was not applicable then the outcome
may be meaningless, but that's an assertion about the report, not
about the subject of the report.

> If the tool user has decided some thresholds (or other criteria) to
> sift these probabilities, then one could get the desired outcomes
> (success/fail/cannottell). For example, if the overall probability of
> outcome being fail is above .7 then output=FAIL; if probab. of outcome
> being success is above .7 then output=SUCCESS; else output=CANNOTTELL.
> Notice that it is the tool user, or perhaps the user of the EARL
> report produced by the tool, that might want to define these
> thresholds/criteria.

If you want to work with numerical values, that's fine by me.
I'll remain sceptical about this approach until and unless you can
demonstrate it.  I say that based on quite a lot of work in the
field of statistical evaluations, mostly for national government
clients (and as external assessor for PhD work).

> 
> How these probabilities are computed/determined is another story (for
> example, one could do some experiments using the tool and sampling the
> number of times that the tool produced the wrong output with each of
> the tests). Or one could associate values 100% or 0% to hand-made
> evaluations.

That might be useful to help tune confidence values (are you
volunteering for the work?).  But sampling for it is a big issue.

> But in any case, I think EARL should allow one to state applicability
> of a test and outcome of a test in terms of probabilities. Having a
> clearly defined underlying model helps in deciding how to represent
> information, and therefore in deciding how to use them. And having a
> detailed model does not require that any user of the EARL language is
> required to use that detail. If appropriate abstractions can be
> defined in the language, then the underlying model might get unnoticed
> by most users.

That's fine.  Were you on the call where we discussed this?  I think
we were in favour of allowing each tool its own choice over how to
express confidence values.

As regards applicability, do you have any proposals on how to
apply^H^H^H express this in EARL?

Regards,

-- 
Nick Kew
Received on Tuesday, 12 July 2005 23:31:32 UTC