Re: Another comment about confidence value. from Giorgio Brajnik on 2005-04-20 (public-wai-ert@w3.org from April 2005)

From: Giorgio Brajnik <giorgio@dimi.uniud.it>
Date: Wed, 20 Apr 2005 14:09:04 +0200
To: Paul Walsh <paul.walsh@segalamtest.com>
CC: 'Charles McCathieNevile' <charles@sidar.org>, 'Nils Ulltveit-Moe' <nils@u-moe.no>, public-wai-ert@w3.org
Message-ID: <42664660.4090102@dimi.uniud.it>
Paul,
consider these examples on how to derive confidence factors (CF), 
assuming they are probabilities, for test results that are manually or 
automatically performed:

1) you run a tool on a website; you collect its results; do a sampling 
on some of the issues found by the tool by asking one of your teams and 
find out how many issues were wrong. Then, if you restric this analysis 
to (say) checkpoint 1.1 tests, you can derive these probabilities.

2) you do a manual assessment by 3 teams (or 3 evaluators); and for each 
issue where consensus is less than 100%, you assign a confidence factor 
of "medium" or "low".

I agree that these examples are somewhat fictional, at the moment. But I 
don't see any reasons why including CFs in the EARL ontology is a bad 
thing. I'd guess that in a less formal way, CFs are also discussed when 
the teams you mention set up to write a report: it's going to be that 
way whenever you might have any doubt on what you state. And obviously 
EARL reports will always be written through some sort of tools, that 
could provide a nice user interface for assigning CFs.

In addition I believe that some of the data that testing tools provide 
via their user interfaces could be made much more clearer if they'd use 
the CF idea. For example this would allow a user to easily select and 
filter results based also on how certain they are.

Finally consider that CFs assigned to EARL statements will usually be 
self-reported statements; i.e it is the author of the EARL report that 
states his/her confidence factor on some statement, and therefore it can 
be totally subjective and far from truth.  But here we're discussing the 
language to use to describe test results, not their trustfulness.

Regarding the formalization of CF, I understand that there might be the 
need for different scales (at least numeric: [0,1], {10%,20%,...,100%}, 
...; and ordinal: {low,med,high}, ...) and therefore means for relating 
them.
But one thing is to stick to a certain meaning (that CF are 
probabilities) and another thing is to determine how to represent these 
probabilities.

I see the following open questions that need a well-engineered approach 
in order to be answered:
a) what do we gain from including CFs in EARL? (although above I tried 
to explain my view, at the moment there is no system nor representation 
(like EARL) that uses them, and that plays the role of a 
proof-of-concept. It also looks that some people are working on this 
issue, and we should soon see more evidence of the usefulness of CFs.)

b) how to represent CFs? This probably depends on two processes: how are 
CFs being used (for example in the user interface of tools reading EARL 
reports) and how are CFs produced (example 1 I gave above will probably 
produce values \in [0,1], while example 2 might produce values \in 
{low,med,high}).

c) how are different scales being compared? This could be faced by using 
the underlying [0,1] scale as the most refined one, and the other scales 
should be mappable to this one. Either by specifying, on each single 
report, what is the meaning of a symbol (eg. low == p \leq 0.33) or by 
defaulting to an EARL-defined ordinal scale. The choice of [0,1] would 
allow an easy way for comparing all possible CFs, as they are first 
canonicalized into a well-defined scale. The appropriateness of 
solutions probably depends on why we need to compare CFs that use 
different scales, and why they have been generated usign these scales.

regards,
-- 
         Giorgio Brajnik
______________________________________________________________________
Dip. di Matematica e Informatica   | voice: +39 (0432) 55.8445
Università di Udine                | fax:   +39 (0432) 55.8499
Via delle Scienze, 206             | email: giorgio@dimi.uniud.it
Loc. Rizzi -- 33100 Udine -- ITALY | http://www.dimi.uniud.it/giorgio



Paul Walsh wrote:
> You're correct, it's no clearer :)
> 
> You have provided examples of where I believe this process should be
> used so we're in total agreement. Perhaps you can provide examples
> surrounding web site accessibility?
> 
> Cheers
> Paul
> 
> -----Original Message-----
> From: Charles McCathieNevile [mailto:charles@sidar.org] 
> Sent: 19 April 2005 18:54
> To: Paul Walsh; 'Nils Ulltveit-Moe'
> Cc: 'Giorgio Brajnik'; public-wai-ert@w3.org
> Subject: Re: Another comment about confidence value.
> 
> On Tue, 19 Apr 2005 11:32:01 +0200, Paul Walsh  
> <paul.walsh@segalamtest.com> wrote:
> 
> (I think this bit was Nils - CMN)
> 
>>I appreciate that. With such a profile your testers would most
> 
> probably
> 
>>be quite confident in their decisions, and if you are 100% confident
>>that an accessibility issue is real, then the extra confidence value
> 
> is
> 
>>not needed. (i.e. the default value for confidence, if it is left out,
>>is 1).
> 
> 
>>[PW] Every 'validation' company needs to follow the same process
>>irrespective of experience. That way, the output of the 'team' will be
>>100 confident in their interpretation of the checkpoint passing or
>>failing. If they are not, then you have an issue with that company's
>>capabilities and/or understanding of the checkpoints.
> 
> 
> This is why I want to have a variety of confidence datatypes. In
> principle  
> you would have one per test process, but in practice there are going to
> be  
> lots of overlaps - for example if 100 different tests, run according to
> 
> Nils' process, give probability results accurate to 2 significant
> figures,  
> then it is probably OK to use the same datatype for all of them
> 
> On the other hand if I use a different process for a similar test, and
> its  
> results are different, I should use a different datatype. That way it is
> 
> possible to compare the results more accurately if I know more about the
> 
> differences in how the confidence is generated. The sort of examples
> that  
> spring to mind are to do with the accuracy of meters, or of labelling on
> 
> resistors, not WCAG conformance. For WCAG I think these comparisons, and
> 
> for that matter many confidence level sets, are going to be based on  
> smaller sets - High medium low, integer from 1 to 7, etc.
> 
> I suspect I still haven't made this very clear. Any hints?
> 
> cheers
> 
> Chaals
>
Received on Wednesday, 20 April 2005 12:09:17 UTC