- From: Giorgio Brajnik <giorgio@dimi.uniud.it>
- Date: Wed, 20 Apr 2005 14:09:04 +0200
- To: Paul Walsh <paul.walsh@segalamtest.com>
- CC: 'Charles McCathieNevile' <charles@sidar.org>, 'Nils Ulltveit-Moe' <nils@u-moe.no>, public-wai-ert@w3.org
Paul, consider these examples on how to derive confidence factors (CF), assuming they are probabilities, for test results that are manually or automatically performed: 1) you run a tool on a website; you collect its results; do a sampling on some of the issues found by the tool by asking one of your teams and find out how many issues were wrong. Then, if you restric this analysis to (say) checkpoint 1.1 tests, you can derive these probabilities. 2) you do a manual assessment by 3 teams (or 3 evaluators); and for each issue where consensus is less than 100%, you assign a confidence factor of "medium" or "low". I agree that these examples are somewhat fictional, at the moment. But I don't see any reasons why including CFs in the EARL ontology is a bad thing. I'd guess that in a less formal way, CFs are also discussed when the teams you mention set up to write a report: it's going to be that way whenever you might have any doubt on what you state. And obviously EARL reports will always be written through some sort of tools, that could provide a nice user interface for assigning CFs. In addition I believe that some of the data that testing tools provide via their user interfaces could be made much more clearer if they'd use the CF idea. For example this would allow a user to easily select and filter results based also on how certain they are. Finally consider that CFs assigned to EARL statements will usually be self-reported statements; i.e it is the author of the EARL report that states his/her confidence factor on some statement, and therefore it can be totally subjective and far from truth. But here we're discussing the language to use to describe test results, not their trustfulness. Regarding the formalization of CF, I understand that there might be the need for different scales (at least numeric: [0,1], {10%,20%,...,100%}, ...; and ordinal: {low,med,high}, ...) and therefore means for relating them. But one thing is to stick to a certain meaning (that CF are probabilities) and another thing is to determine how to represent these probabilities. I see the following open questions that need a well-engineered approach in order to be answered: a) what do we gain from including CFs in EARL? (although above I tried to explain my view, at the moment there is no system nor representation (like EARL) that uses them, and that plays the role of a proof-of-concept. It also looks that some people are working on this issue, and we should soon see more evidence of the usefulness of CFs.) b) how to represent CFs? This probably depends on two processes: how are CFs being used (for example in the user interface of tools reading EARL reports) and how are CFs produced (example 1 I gave above will probably produce values \in [0,1], while example 2 might produce values \in {low,med,high}). c) how are different scales being compared? This could be faced by using the underlying [0,1] scale as the most refined one, and the other scales should be mappable to this one. Either by specifying, on each single report, what is the meaning of a symbol (eg. low == p \leq 0.33) or by defaulting to an EARL-defined ordinal scale. The choice of [0,1] would allow an easy way for comparing all possible CFs, as they are first canonicalized into a well-defined scale. The appropriateness of solutions probably depends on why we need to compare CFs that use different scales, and why they have been generated usign these scales. regards, -- Giorgio Brajnik ______________________________________________________________________ Dip. di Matematica e Informatica | voice: +39 (0432) 55.8445 Università di Udine | fax: +39 (0432) 55.8499 Via delle Scienze, 206 | email: giorgio@dimi.uniud.it Loc. Rizzi -- 33100 Udine -- ITALY | http://www.dimi.uniud.it/giorgio Paul Walsh wrote: > You're correct, it's no clearer :) > > You have provided examples of where I believe this process should be > used so we're in total agreement. Perhaps you can provide examples > surrounding web site accessibility? > > Cheers > Paul > > -----Original Message----- > From: Charles McCathieNevile [mailto:charles@sidar.org] > Sent: 19 April 2005 18:54 > To: Paul Walsh; 'Nils Ulltveit-Moe' > Cc: 'Giorgio Brajnik'; public-wai-ert@w3.org > Subject: Re: Another comment about confidence value. > > On Tue, 19 Apr 2005 11:32:01 +0200, Paul Walsh > <paul.walsh@segalamtest.com> wrote: > > (I think this bit was Nils - CMN) > >>I appreciate that. With such a profile your testers would most > > probably > >>be quite confident in their decisions, and if you are 100% confident >>that an accessibility issue is real, then the extra confidence value > > is > >>not needed. (i.e. the default value for confidence, if it is left out, >>is 1). > > >>[PW] Every 'validation' company needs to follow the same process >>irrespective of experience. That way, the output of the 'team' will be >>100 confident in their interpretation of the checkpoint passing or >>failing. If they are not, then you have an issue with that company's >>capabilities and/or understanding of the checkpoints. > > > This is why I want to have a variety of confidence datatypes. In > principle > you would have one per test process, but in practice there are going to > be > lots of overlaps - for example if 100 different tests, run according to > > Nils' process, give probability results accurate to 2 significant > figures, > then it is probably OK to use the same datatype for all of them > > On the other hand if I use a different process for a similar test, and > its > results are different, I should use a different datatype. That way it is > > possible to compare the results more accurately if I know more about the > > differences in how the confidence is generated. The sort of examples > that > spring to mind are to do with the accuracy of meters, or of labelling on > > resistors, not WCAG conformance. For WCAG I think these comparisons, and > > for that matter many confidence level sets, are going to be based on > smaller sets - High medium low, integer from 1 to 7, etc. > > I suspect I still haven't made this very clear. Any hints? > > cheers > > Chaals >
Received on Wednesday, 20 April 2005 12:09:17 UTC