RE: Another comment about confidence value. from Paul Walsh on 2005-04-19 (public-wai-ert@w3.org from April 2005)

From: Paul Walsh <paul.walsh@segalamtest.com>
Date: Tue, 19 Apr 2005 10:27:30 +0100
To: "'Nils Ulltveit-Moe'" <nils@u-moe.no>
Cc: "'Charles McCathieNevile'" <charles@sidar.org>, "'Giorgio Brajnik'" <giorgio@dimi.uniud.it>, <public-wai-ert@w3.org>
Message-ID: <000201c544c2$032d1ec0$0200a8c0@PaulLaptop>
Teaching granny how to suck eggs springs to mind. I know confidence
levels are different to priority levels - didn't you get the bit about
the experience at the end of the email.

Kind regards,
Paul

-----Original Message-----
From: Nils Ulltveit-Moe [mailto:nils@u-moe.no] 
Sent: 19 April 2005 10:16
To: Paul Walsh
Cc: 'Charles McCathieNevile'; 'Giorgio Brajnik'; public-wai-ert@w3.org
Subject: RE: Another comment about confidence value.

Hi Paul,

man, 18,.04.2005 kl. 23.30 +0100, skrev Paul Walsh:
> Ok I've sat back and read most people?s thoughts on this subject and
> would now like to ask a question of those who believe we should
> include a confidence level. I personally still think this is a bad
> idea for all the same reasons I stated in my original email. I feel
> priority and/or severity levels are the most widely used and
> understood mandatory fields in a defect tracking tool and even then
> they are almost always misused at least once on any given project when
> working with external parties outside of your control - especially by
> 'developers' who think they have the aptitude of a test analyst, but
> do not. Introducing a confidence level will simply make defect report
> writing and evaluation more time consuming. You can argue until the
> pigs come home, but we will not use a confidence level in our
> reporting. 

I agree that EARL should be able to convey priority levels. Priority
levels would indicate how important different checkpoints are, and for
accessibility tests, that would somehow be related to how big impact a
failed test would have for a disabled user. 

For priority levels, we need to write an RDF/OWL schema that defines the
scale that is used for the priority in a nonambiguous way; i.e. using a
W3C convention for priority scales identifying which is the high and
which is the low priority. In a textual representation this would be
something like: WCAG priority scale consists of 3 levels,
wcag-priority1, wcag-priority2 and wcag-priority3, where wcag-priority1
is the highest priority. If needed, there may also be tied number values
to the priories, alternatives the priorities may be defined as numbers,
if we want to. (i.e. 1,2 and 3 where 1 is highest priority).

One challenge with priority levels, is that different groups of disabled
have different viewpoint of which tests that are important. For a blind
user, checkpoint 1.1 for alternative text is important. 1.1 is not
important for a deaf user, however missing texting of a video clip may
be a barrier for the deaf user, which that user would consider
important.

So this means that we should co-operate with disabled organisations on
defining a set of priorities for each group. Since we should not
discriminate between disabled users, this means that the priority of a
checkpoint would somehow be related to the priority of the group of
disabled that has the checkpoint highest prioritised; i.e. if cp1.1 is
priority 1 for a blind user  and priority 3 for a user with vision, then
the checkpoint should be measured as priority 1 in order of not
discriminating a blind user in favor of a deaf.

Confidence values, however, is another thing, that is not related to
priorities. Confidence values shows how confident an automatic
assessment tool is that a checkpoint can be categorised as either Pass,
Fail, CannotTell, NotApplicable etc. This is especially useful for
knowledge based systems that has learnt to categorise accessibiliy
issues by example. If the system comes across an issue it has been
taught that is a Fail several times, it will be quite confident that
this is a real issue, if that pattern occurs again. If a similar, but a
bit different pattern occurs, then the system may still say that this is
a Fail, but with less confidence. You have the same problem with manual
assessments as well. An inexperienced accessibility tester will perform
tests with less confidence than an experienced tester. Especially in
cases where the tester is in doubt, or cases the tester has not
experienced before.

Confidence values should be defined as an optional parameter, since I
appreciate that not all vendors may want to make use of it. However, we
plan to use it for automatic assessments, and may also experiment with
using it for manual assessments, and I think it is important that EARL,
as a machine readable format, is able to convey that information.

> Q:
> 
> For companies who use disabled users; how do you suggest they measure
> the confidence level of their output? (I'm not assuming you can cover
> every disability across every project or even one project - but it's
> compulsory in my opinion to 'try', using the usual quality triangle to
> ensure testing is cost/time effective).

As I said above, the confidence level is not a priority of the
importance of a checkpoint. Is is an indication of how certain or
uncertain the auditor is in his/her/its decision. (It in the case of
automatic assessment)

> Situation: Dyslexic user is provided with high-level test case
> scenarios, where auditors drill down further with detailed documented
> test scripts using both manual and automated methods. The dyslexic
> user has a problem with the complexity of the copy in two areas of the
> website. This type of defect is not picked up by the auditor or the
> tool, nor is it appreciated by the auditor. How do you measure the
> confidence level of those two defects?

This would mean that the tool and the auditor had chosen #Pass for a
checkpoint that is indeed a problem. If the auditor or machine is going
to learn from this fault, then they need feedback from the user who
experienced the problem. They would have to learn the case the dyslectic
person describes. If an automatic system or a manual assessor got such
feedback, then they would learn from this, and lower their confidence in
the decision they took, and eventually switch over to #Fail if this
issue happened several times. (I.e. the assessor got confident that this
issue was a problem.)

> We have some of the most highly skilled and experienced test analysts
> and developers who have worked for companies such as AOL since 1994
> and were responsible for the entire test management and execution and
> International beta coordination of all new client software and
> technology for the UK and Sweden whilst providing ongoing support to
> Germany and France - trust me when I say they are more experienced
> than most when it comes to 'testing' Internet technologies.

I appreciate that. With such a profile your testers would most probably
be quite confident in their decisions, and if you are 100% confident
that an accessibility issue is real, then the extra confidence value is
not needed. (i.e. the default value for confidence, if it is left out,
is 1).

> We use both manual and automated testing methods where the former
> outweighs the latter by a long way. If someone is less than certain
> about the output of their test they will always seek a second opinion
> from their colleagues. This is why it?s absolutely necessary to have a
> team of auditors on any project. Each person?s interpretation of an
> outcome is debated until they come to an agreement. The combined
> interpretation may not be 100% accurate if compared to that of a
> disabled user (or even someone outside the company), but at least they
> are 100% confident in the recorded defect.  Anything less than this is
> not good enough.

Yes, and this describes why you do not need the confidence parameter,
since it defaults to a probability of 1 (or 100%).

We are doing quite different measurements. We will be trying to do
automatic assessments of a large number of sites (several thousand)
regularly. We will need to do some manual testing, and will base our
tests largely on automatic assessments. In our case we need to base
ourself on probability theory and best practices in statistics to reach
numbers that approximate the perceived accessibility over a large number
of assessments, to make it feasible.

Regards,
-- 
Nils Ulltveit-Moe <nils@u-moe.no>
Received on Tuesday, 19 April 2005 09:27:30 UTC