From: Nick Kew <nick@webthing.com>

Date: Wed, 13 Jul 2005 00:32:49 +0100

Message-ID: <42D45321.20001@webthing.com>

To: public-wai-ert@w3.org

Date: Wed, 13 Jul 2005 00:32:49 +0100

Message-ID: <42D45321.20001@webthing.com>

To: public-wai-ert@w3.org

Giorgio Brajnik wrote: > I think that a deeper insight into the underlying model might be > useful. Let me suggest again to go back to probabilities. In principle I'd be happy with that. But in practice, I don't see how we can usefully work with probabilities other than as very fuzzy numerical versions of High/Medium/Low. That potentially takes us to 1980s AI. > First of all, tests might have a non-binary applicability. > > For example, a test like "any given IMG should have a syntactically > valid ALT attribute" would have a certain (100%) applicability ; a > test like "if a given TABLE represents a data-table, then the first > row and first column should be made of TH" will probably have an > applicability that is below 100% (unless the tool uses a perfect > criterion for distinction, which as far as I know is not available > yet). Applicability might be defined like "probability that the test > applies correctly to the given element of the page". Indeed. But that's part of the test spec. A related issue is that a test may be applicable to more than one guideline: currently we have no satisfactory way to express this, except by grouping guidelines together. I think that may be an argument for introducing new terms into EARL (at the cost of added complexity). I don't think it has a bearing on confidence levels as currently used. > Second, whatever is the representation of the outcomes of a test (eg > pass/fail) I think this outcome should also be bound to probabilities. > For example we could talk about the probability prob{outcome of test T > applied to element E is "fail"}=0.6. Are you saying 60% of all elements E will fail test T? If so, that may be useful information about the *test*, but it's not part of the *report*. What the report is concerned with is the probability that E fails *given the outcome of T in this instance*. But probabilities are very hard to estimate. Descriptive confidence values are less open to misinterpretation and abuse than numerical probability values. > Third, applicability of the test to an element and probability of the > test giving fail should be combined (by multiplying them). In this way > a test (for example the one with data tables mentioned above) might > have applicability=0.5 and it might yield fail with prob. 0.8. This > means that the overall probability (of the result being fail) is 0.4 > and prob. of being success is 0.1. How is that relevant to an assessment? If the test has been applied then there is an outcome. If it was not applicable then the outcome may be meaningless, but that's an assertion about the report, not about the subject of the report. > If the tool user has decided some thresholds (or other criteria) to > sift these probabilities, then one could get the desired outcomes > (success/fail/cannottell). For example, if the overall probability of > outcome being fail is above .7 then output=FAIL; if probab. of outcome > being success is above .7 then output=SUCCESS; else output=CANNOTTELL. > Notice that it is the tool user, or perhaps the user of the EARL > report produced by the tool, that might want to define these > thresholds/criteria. If you want to work with numerical values, that's fine by me. I'll remain sceptical about this approach until and unless you can demonstrate it. I say that based on quite a lot of work in the field of statistical evaluations, mostly for national government clients (and as external assessor for PhD work). > > How these probabilities are computed/determined is another story (for > example, one could do some experiments using the tool and sampling the > number of times that the tool produced the wrong output with each of > the tests). Or one could associate values 100% or 0% to hand-made > evaluations. That might be useful to help tune confidence values (are you volunteering for the work?). But sampling for it is a big issue. > But in any case, I think EARL should allow one to state applicability > of a test and outcome of a test in terms of probabilities. Having a > clearly defined underlying model helps in deciding how to represent > information, and therefore in deciding how to use them. And having a > detailed model does not require that any user of the EARL language is > required to use that detail. If appropriate abstractions can be > defined in the language, then the underlying model might get unnoticed > by most users. That's fine. Were you on the call where we discussed this? I think we were in favour of allowing each tool its own choice over how to express confidence values. As regards applicability, do you have any proposals on how to apply^H^H^H express this in EARL? Regards, -- Nick KewReceived on Tuesday, 12 July 2005 23:31:32 UTC

*
This archive was generated by hypermail 2.3.1
: Tuesday, 6 January 2015 20:55:53 UTC
*