RE: [Techs] Definition of "Reliably human testable"

In the message below, I cited a 2003 study that points to *poor* inter-rater reliability in usability studies that use methods like cognitive walkthrough, heuristic evaluation, and think-aloud protocolsstudies.

The article recommends using multiple evaluators to solve this problem.

And I should have added that, since our success criteria are written as testable statements of functional outcome, they should (!) support a higher degree of inter-rater reliability.

John


"Good design is accessible design." 
John Slatin, Ph.D.
Director, Accessibility Institute
University of Texas at Austin
FAC 248C
1 University Station G9600
Austin, TX 78712
ph 512-495-4288, f 512-495-4524
email jslatin@mail.utexas.edu
web http://www.utexas.edu/research/accessibility/


 



-----Original Message-----
From: w3c-wai-gl-request@w3.org [mailto:w3c-wai-gl-request@w3.org] On Behalf Of John M Slatin
Sent: Thursday, April 28, 2005 9:20 am
To: boland@nist.gov; w3c-wai-gl@w3.org
Subject: RE: [Techs] Definition of "Reliably human testable"



Tim wrote:
<blockquote>
The "test environment" for the human evaluators may be important.  If the 
testing is done with human evaluators "in isolation", as opposed to being 
in "focus groups", undue influence among evaluators may be avoided, and the 80% 
figure may be more credible.. (maybe this factor has been considered in 
the "literature"?) 
</blockquote>

Yes, it has.

A typical procedure might go like this:
- Two informed evaluators independently review the item in question
- The results are compared. If they fall within a given range (which varies from situation to situation), the scores are averaged and that's that; or the two might mmet, to work through their differences and arrive at an agreed-upon score
- If the results fall outside the acceptable range, either the evaluators meet to work through their differences or a third informed evaluator comes in, etc.

This sort of paired evaluation helps to reduce the impact of what one study calls the "evaluator effect" in usability studies:

<blockquote cite="http://www.leaonline.com/doi/abs/10.1207%2FS15327590IJHC1501_14?cookieSet=1">
Computer professionals have a need for robust, easy-to-use usability evaluation methods (UEMs) to help them systematically improve the usability of computer artifacts. However, cognitive walkthrough (CW), heuristic evaluation (HE), and thinking- aloud study (TA)-3 of the most widely used UEMs-suffer from a substantial evaluator effect in that multiple evaluators evaluating the same interface with the same UEM detect markedly different sets of problems. A review of 11 studies of these 3 UEMs reveals that the evaluator effect exists for both novice and experienced evaluators, for both cosmetic and severe problems, for both problem detection and severity assessment, and for evaluations of both simple and complex systems. The average agreement between any 2 evaluators who have evaluated the same system using the same UEM ranges from 5% to 65%, and no 1 of the 3 UEMs is consistently better than the others. Although evaluator effects of this magnitude may not be surprising for a UEM as informal as HE, it is certainly notable that a substantial evaluator effect persists for evaluators who apply the strict procedure of CW or observe users thinking out loud. Hence, it is highly questionable to use a TA with 1 evaluator as an authoritative statement about what problems an interface contains. Generally, the application of the UEMs is characterized by (a) vague goal analyses leading to variability in the task scenarios, (b) vague evaluation procedures leading to anchoring, or (c) vague problem criteria leading to anything being accepted as a usability problem, or all of these. The simplest way of coping with the evaluator effect, which cannot be completely eliminated, is to involve multiple evaluators in usability evaluations.

</blockquote>
Source:
The Evaluator Effect: A Chilling Fact About Usability Evaluation Methods

By
Morten Hertzum

Centre for Human-Machine Interaction, RisøNational Laboratory,Denmark Niels Ebbe Jacobsen

Nokia Mobile Phones, Denmark


In

International Journal of Human-Computer Interaction
2003, Vol. 15, No. 1, Pages 183-204

"Good design is accessible design." 
John Slatin, Ph.D.
Director, Accessibility Institute
University of Texas at Austin
FAC 248C
1 University Station G9600
Austin, TX 78712
ph 512-495-4288, f 512-495-4524
email jslatin@mail.utexas.edu
web http://www.utexas.edu/research/accessibility/


 



-----Original Message-----
From: w3c-wai-gl-request@w3.org [mailto:w3c-wai-gl-request@w3.org] On Behalf Of boland@nist.gov
Sent: Thursday, April 28, 2005 8:19 am
To: w3c-wai-gl@w3.org
Subject: Re: [Techs] Definition of "Reliably human testable"



The "test environment" for the human evaluators may be important.  If the 
testing is done with human evaluators "in isolation", as opposed to being 
in "focus groups", undue influence among evaluators may be avoided, and the 80% 
figure may be more credible.. (maybe this factor has been considered in 
the "literature"?) 

  Quoting John M Slatin <john_slatin@austin.utexas.edu>:

> On the Techniques call today we discussed the proposed definition of
> the term "reliably human testable":
>  
> <proposed>
> [Definition: Reliably Human Testable: The technique can be tested by
> human inspection and it is believed that at least 80% of knowledgeable 
> human evaluators would agree on the conclusion. Tests done by people 
> who understand the guidelines should get the same results testing the 
> same content for the same success
> criteria. The use of probabilistic machine algorithms may facilitate the
> human testing process but this does not make it machine testable.]
> 
> </proposed>
>  
> Someone on the call asked whether the 80 percent figure represented an
> arbitrary number.  I took an action item to find out and report back. 
> With terrific help from David Macdonald, I've got an answer:
>  
> The literature seems to support the 80 per cent figure.  In fact,
> inter-rater reliability (percentage of agreement among multiple people 
> rating the same items) is considered "adequate," but 85% is considered 
> better. I think we're safe in using the 80 percent figure. If we go 
> lower than that it will be difficult to claim reliability.
>  
> John
>  
> 
> "Good design is accessible design."
> John Slatin, Ph.D.
> Director, Accessibility Institute
> University of Texas at Austin
> FAC 248C
> 1 University Station G9600
> Austin, TX 78712
> ph 512-495-4288, f 512-495-4524
> email jslatin@mail.utexas.edu
> web http://www.utexas.edu/research/accessibility/
> <http://www.utexas.edu/research/accessibility/>
> 
> 
>  
> 
>  
> 

Received on Thursday, 28 April 2005 14:25:53 UTC