- From: John M Slatin <john_slatin@austin.utexas.edu>
- Date: Thu, 28 Apr 2005 09:19:31 -0500
- To: <boland@nist.gov>, <w3c-wai-gl@w3.org>
Tim wrote: <blockquote> The "test environment" for the human evaluators may be important. If the testing is done with human evaluators "in isolation", as opposed to being in "focus groups", undue influence among evaluators may be avoided, and the 80% figure may be more credible.. (maybe this factor has been considered in the "literature"?) </blockquote> Yes, it has. A typical procedure might go like this: - Two informed evaluators independently review the item in question - The results are compared. If they fall within a given range (which varies from situation to situation), the scores are averaged and that's that; or the two might mmet, to work through their differences and arrive at an agreed-upon score - If the results fall outside the acceptable range, either the evaluators meet to work through their differences or a third informed evaluator comes in, etc. This sort of paired evaluation helps to reduce the impact of what one study calls the "evaluator effect" in usability studies: <blockquote cite="http://www.leaonline.com/doi/abs/10.1207%2FS15327590IJHC1501_14?cookieSet=1"> Computer professionals have a need for robust, easy-to-use usability evaluation methods (UEMs) to help them systematically improve the usability of computer artifacts. However, cognitive walkthrough (CW), heuristic evaluation (HE), and thinking- aloud study (TA)-3 of the most widely used UEMs-suffer from a substantial evaluator effect in that multiple evaluators evaluating the same interface with the same UEM detect markedly different sets of problems. A review of 11 studies of these 3 UEMs reveals that the evaluator effect exists for both novice and experienced evaluators, for both cosmetic and severe problems, for both problem detection and severity assessment, and for evaluations of both simple and complex systems. The average agreement between any 2 evaluators who have evaluated the same system using the same UEM ranges from 5% to 65%, and no 1 of the 3 UEMs is consistently better than the others. Although evaluator effects of this magnitude may not be surprising for a UEM as informal as HE, it is certainly notable that a substantial evaluator effect persists for evaluators who apply the strict procedure of CW or observe users thinking out loud. Hence, it is highly questionable to use a TA with 1 evaluator as an authoritative statement about what problems an interface contains. Generally, the application of the UEMs is characterized by (a) vague goal analyses leading to variability in the task scenarios, (b) vague evaluation procedures leading to anchoring, or (c) vague problem criteria leading to anything being accepted as a usability problem, or all of these. The simplest way of coping with the evaluator effect, which cannot be completely eliminated, is to involve multiple evaluators in usability evaluations. </blockquote> Source: The Evaluator Effect: A Chilling Fact About Usability Evaluation Methods By Morten Hertzum Centre for Human-Machine Interaction, RisøNational Laboratory,Denmark Niels Ebbe Jacobsen Nokia Mobile Phones, Denmark In International Journal of Human-Computer Interaction 2003, Vol. 15, No. 1, Pages 183-204 "Good design is accessible design." John Slatin, Ph.D. Director, Accessibility Institute University of Texas at Austin FAC 248C 1 University Station G9600 Austin, TX 78712 ph 512-495-4288, f 512-495-4524 email jslatin@mail.utexas.edu web http://www.utexas.edu/research/accessibility/ -----Original Message----- From: w3c-wai-gl-request@w3.org [mailto:w3c-wai-gl-request@w3.org] On Behalf Of boland@nist.gov Sent: Thursday, April 28, 2005 8:19 am To: w3c-wai-gl@w3.org Subject: Re: [Techs] Definition of "Reliably human testable" The "test environment" for the human evaluators may be important. If the testing is done with human evaluators "in isolation", as opposed to being in "focus groups", undue influence among evaluators may be avoided, and the 80% figure may be more credible.. (maybe this factor has been considered in the "literature"?) Quoting John M Slatin <john_slatin@austin.utexas.edu>: > On the Techniques call today we discussed the proposed definition of > the term "reliably human testable": > > <proposed> > [Definition: Reliably Human Testable: The technique can be tested by > human inspection and it is believed that at least 80% of knowledgeable > human evaluators would agree on the conclusion. Tests done by people > who understand the guidelines should get the same results testing the > same content for the same success > criteria. The use of probabilistic machine algorithms may facilitate the > human testing process but this does not make it machine testable.] > > </proposed> > > Someone on the call asked whether the 80 percent figure represented an > arbitrary number. I took an action item to find out and report back. > With terrific help from David Macdonald, I've got an answer: > > The literature seems to support the 80 per cent figure. In fact, > inter-rater reliability (percentage of agreement among multiple people > rating the same items) is considered "adequate," but 85% is considered > better. I think we're safe in using the 80 percent figure. If we go > lower than that it will be difficult to claim reliability. > > John > > > "Good design is accessible design." > John Slatin, Ph.D. > Director, Accessibility Institute > University of Texas at Austin > FAC 248C > 1 University Station G9600 > Austin, TX 78712 > ph 512-495-4288, f 512-495-4524 > email jslatin@mail.utexas.edu > web http://www.utexas.edu/research/accessibility/ > <http://www.utexas.edu/research/accessibility/> > > > > > >
Received on Thursday, 28 April 2005 14:19:36 UTC