Re: Confidences in accessibilty evaluation from Giorgio Brajnik on 2005-07-13 (public-wai-ert@w3.org from July 2005)

From: Giorgio Brajnik <giorgio.brajnik@gmail.com>
Date: Wed, 13 Jul 2005 12:32:30 +0200
To: Nick Kew <nick@webthing.com>
Cc: public-wai-ert@w3.org
Message-ID: <b00f672205071303321032e01c@mail.gmail.com>
On 7/13/05, Nick Kew <nick@webthing.com> wrote:
> 
> Giorgio Brajnik wrote:
> > I think that a deeper insight into the underlying model might be
> > useful. Let me suggest again to go back to probabilities.
> 
> In principle I'd be happy with that.  But in practice, I don't
> see how we can usefully work with probabilities other than as
> very fuzzy numerical versions of High/Medium/Low.  That potentially
> takes us to 1980s AI.

Sorry but I don't agree: probabilities have been invented long ago,
and is a concept that is used in every aspect of science and
technology. Before throwing them away, I'd like to make sure that they
are not useful.
My point is to have a model rooted on probabilities, and then define
appropriate abstractions that are suitable for some tasks. Otherwise,
any symbol like "High" is doomed to lead to confusion and ambiguity,
and eventually failure of the language.

For example, if two tools say for the same table in the same page that
it requires TH (tool A) and that it does not (tool B). If you're going
to combine somehow their output, then you need a method to combine
your High/Med/Low values, but unless you have a well defined
underlying model you won't be able to do it soundly.
> 
> > First of all, tests might have a non-binary applicability.
> >
> > For example, a test like "any given IMG should have a syntactically
> > valid ALT attribute" would have a certain (100%) applicability ; a
> > test like "if a given TABLE represents a data-table, then the first
> > row and first column should be made of TH" will probably have an
> > applicability that is below 100% (unless the tool uses a perfect
> > criterion for distinction, which as far as I know is not available
> > yet). Applicability might be defined like "probability that the test
> > applies correctly to the given element of the page".
> 
> Indeed.  But that's part of the test spec.  A related issue is
> that a test may be applicable to more than one guideline: currently
> we have no satisfactory way to express this, except by grouping
> guidelines together.
> 
> I think that may be an argument for introducing new terms into EARL
> (at the cost of added complexity).  I don't think it has a bearing
> on confidence levels as currently used.

I  think it has: the reasons why a test might yield "dontknow" is that
the test applies heuristics to determine the elements on which to
apply itself and it applies heuristics to determine if such elements
satisfy some property. To me, these are different reasons, and it may
be the case that EARL should represent both of them. For example, one
could query an EARL repository with "give me all the pages that
contain data tables".
> 
> > Second, whatever is the representation of the outcomes of a test (eg
> > pass/fail) I think this outcome should also be bound to probabilities.
> > For example we could talk about the probability prob{outcome of test T
> > applied to element E is "fail"}=0.6.
> 
> Are you saying 60% of all elements E will fail test T?

For example that 60% of the times in which the test on data tables
says "FAIL" it is correct.
> 
> If so, that may be useful information about the *test*, but it's not
> part of the *report*.  What the report is concerned with is the
> probability that E fails *given the outcome of T in this instance*.
> 
> But probabilities are very hard to estimate.  Descriptive confidence
> values are less open to misinterpretation and abuse than numerical
> probability values.

As I said, even if the underlying model is based on probabilities, we
are not required to specify numbers. But at least we know how to
assign some meaning to words. And how to combine these words (see
below).
> 
> > Third, applicability of the test to an element and probability of the
> > test giving fail should be combined (by multiplying them). In this way
> > a test (for example the one with data tables mentioned above) might
> > have applicability=0.5 and it might yield fail with prob. 0.8. This
> > means that the overall probability (of the result being fail) is 0.4
> > and prob. of being success is 0.1.
> 
> How is that relevant to an assessment?  If the test has been applied
> then there is an outcome.  If it was not applicable then the outcome
> may be meaningless, but that's an assertion about the report, not
> about the subject of the report.

You're assuming that a test is either applicable or not applicable,
whereas very often more advanced tests are applicable to a certain
degree. Determining if a table is a data table, if a javascript opens
a new window, if a CSS property should be relative or not, if an image
is a decorator are all examples of tests whose applicability is not
binary.

I'm in fact proposing that probabilities are part of statements about other 
statements, and I don't see why they should not be included in EARL
reports. Isn't CANNOTTELL a statement about the report? It's
definitely not about the web page being tested.

Let's assume that the data table test mentioned above has an
applicability of 80% (meaning that it is correct in 80% of the times
where it says that a table element is a data table); further assume
that it determines that the table lacks appropriate
scope/id/headers/axis info between cells, and that it highlights some
cells. The probability of doing this correctly might be 70%.
Then the prob. of the complete outcome of the test (that a page
contains a data table that lacks proper markup) is 56%.

On another page the same test (perhaps guided by a configuration file)
has an applicability of 100%;  on a similar table the outcome would
have a prob of 70%.

Separating applicability from the property being tested might be
useful also for "manual tests". So a test on data tables might warn
the user that a table (supposedly being a data table, with prob. 80%)
requires additional appropriate header cells to be marked up. In this
case it would be useful to have in the EARL report some data (may be
abstracted in LOW/HIGH/CERTAIN) characterizing the strength of such a
warning.

> 
> > If the tool user has decided some thresholds (or other criteria) to
> > sift these probabilities, then one could get the desired outcomes
> > (success/fail/cannottell). For example, if the overall probability of
> > outcome being fail is above .7 then output=FAIL; if probab. of outcome
> > being success is above .7 then output=SUCCESS; else output=CANNOTTELL.
> > Notice that it is the tool user, or perhaps the user of the EARL
> > report produced by the tool, that might want to define these
> > thresholds/criteria.
> 
> If you want to work with numerical values, that's fine by me.
> I'll remain sceptical about this approach until and unless you can
> demonstrate it.  I say that based on quite a lot of work in the
> field of statistical evaluations, mostly for national government
> clients (and as external assessor for PhD work).
> 
> >
> > How these probabilities are computed/determined is another story (for
> > example, one could do some experiments using the tool and sampling the
> > number of times that the tool produced the wrong output with each of
> > the tests). Or one could associate values 100% or 0% to hand-made
> > evaluations.
> 
> That might be useful to help tune confidence values (are you
> volunteering for the work?).  But sampling for it is a big issue.

I've done some work on one tool. Obviously similar assessments should
be done for every tool that produced EARL, if we want to have an
overall repository of statistical data for a collection of tools. Tool
manufacturers could add this info in EARL reports produced by the
tool.

> 
> > But in any case, I think EARL should allow one to state applicability
> > of a test and outcome of a test in terms of probabilities. Having a
> > clearly defined underlying model helps in deciding how to represent
> > information, and therefore in deciding how to use them. And having a
> > detailed model does not require that any user of the EARL language is
> > required to use that detail. If appropriate abstractions can be
> > defined in the language, then the underlying model might get unnoticed
> > by most users.
> 
> That's fine.  Were you on the call where we discussed this?  I think
> we were in favour of allowing each tool its own choice over how to
> express confidence values.
> 
> As regards applicability, do you have any proposals on how to
> apply^H^H^H express this in EARL?

No: besides other things I'm definitely not an RDF expert. But I would
simply think of properties like "probability that a test applies
correctly", "probability that a test is correct in assessing some
property" and "prob. of the outcome of the test being correct" (i.e.
the product of the other two probs).
In the next months if I can I will do some work on this, which of
course will be shared with this group.
-- 
        Giorgio Brajnik
Received on Wednesday, 13 July 2005 10:32:39 UTC