W3C home > Mailing lists > Public > w3c-wai-gl@w3.org > April to June 2004

FW: optimum number of usability testing subjects

From: John M Slatin <john_slatin@austin.utexas.edu>
Date: Thu, 3 Jun 2004 14:16:29 -0500
Message-ID: <C46A1118E0262B47BD5C202DA2490D1A02E8FFA3@MAIL02.austin.utexas.edu>
To: <w3c-wai-gl@w3.org>

The article below may be indreictly relevant to our threads on
definition of testability and inter-rater reliability.

"Good design is accessible design."

Dr. John M. Slatin, Director 
Accessibility Institute
University of Texas at Austin 
FAC 248C 
1 University Station G9600 
Austin, TX 78712 
ph 512-495-4288, fax 512-495-4524 
email jslatin@mail.utexas.edu 
Web http://www.utexas.edu/research/accessibility 

-----Original Message-----
From: Jim Allan [mailto:jimallan@tsbvi.edu] 
Sent: Wednesday, June 02, 2004 5:20 PM
To: Kay Lewis; John M Slatin
Subject: optimum number of usability testing subjects

User Interface Design Update Newsletter - May, 2004

Every month HFI reviews the most useful developments in
UI research from major conferences and publications.

In this issue:

Dr. Kath Straub revisits the topic of the optimum number of usability
testing subjects.

The Pragmatic Ergonomist, Dr. Eric Schaffer, gives practical advice.

Kath Straub, Ph.D., CUA, Chief Scientist of HFI.

Enough is enough... but five probably isn't. Evaluating the
"test-five-users" guideline.

Death. Taxes. And the "how-many-users?" debate.

There are a few things that seem inevitable. Death. Taxes. The
"Are-Five-Users-Enough?" panel discussion that occurs at every usability

Every conference.
Every year.

These panels are legend. People get excited. Speakers get hyperbolic.
Listeners get frustrated.


Listeners get frustrated because the debate rages with the same opinions
and no new and compelling data. The answer to the "how-many-users"
question is important. However entertaining, the fact that there is no
resolution frustrates practitioners who need to know how to justify the
choice to test five (6? 10? 90? 150?) users to their management.
Understanding the "right" answer (and why it is right) is particularly
important for individuals institutionalizing their usability practice.
They need to make critical decisions on how to prioritize activities
with limited staff time and within a limited budget and a short window
to build credibility. So, really... This year they will tell us, right?
How many users?

Is so...

For years we have heard that, using the law of diminishing returns, five
users will uncover approximately 80% of the usability problems in a
product (Virzi, 1992).

In support of this claim, Nielsen (Landauer and Nielsen, 1993; Nielsen,
1993) present a meta-analysis of 13 studies in which they calculate
confidence intervals to derive the now famous formula:

     Problems found = N(1-(1-L)^n)

N = number of known problems
L = the probability of any given user finding any given problem n = # of

Since this function ceilings rapidly at five participants, practitioners
typically interpret the formula as advising that five is enough.

Is not...

There are two broad approaches to arguing against the five-user
guideline. One approach is to deconstruct the claim on statistical
methods. Researchers who take this approach argue that inappropriate
calculations were used or that the underlying assumptions are faulty or
not met (Grosvenor, 1999; Woolrych and Cockton, 2001).

Others take a more empirical approach. Spool and Schroeder (2001) report
that testing the first five revealed only 35% of problems identified by
the larger test set. Perfetti and Landesman (2002) show that
participants 6-18 (of 18) each identified five or more problems that
were not uncovered within the first five user tests.

Do you read the fine print?

In fairness, both Virzi and Nielsen place qualifications on the
five-user guideline. Nielsen carefully describes the confidence part of
confidence intervals. Virzi warns that "[s]ubjects should be run until
the number of new problems uncovered drops to an acceptable level."

This leaves unsuspecting readers either to wade through the philosophy
of confidence intervals or test until they've tested to an (unspecified
but) "acceptable" level. It's no wonder that practitioners blink at the
caveats and remember number five.

A new way to decide

Faulkner (2003) buttresses the old empirical evaluation with a
statistical sampling approach to arrive at a novel new way to determine
if five is really enough. She evaluated the five-user guideline in a two
phase experiment.

First, she evaluated the usability of a Web-based time sheet application
by observing deviations from the optimal path over 60 participants.
Then, she used a sampling algorithm to randomly draw smaller sets of
individual users' results from the full dataset for independent
analysis. Set sizes corresponded to the number of users 'tested' in that
simulation. In the course of her experiment she ran 100 simulations each
with user group sizes 5, 10, 20, 30, 40, 50 and 60 users.

She found that, on average, Nielsen's prediction is right. Over 100
simulated tests, testing five users revealed an average of 85% of the
usability problems identified with the larger group.

Averages are good, but for day-to-day practitioners, the range of
problems identified is a more critical figure. The range was not so
promising. Over the 100 simulated tests, the percentage of usability
problems found when testing five participants ranged from nearly 100%
down to only 55%. As any good freshman statistics student could predict,
there is a large variation in outcomes between trials with small
samples. Extrapolating from Faulkner's findings, usability test
designers relying on any single set of five users run the risk that
nearly half the problems could be missed.

Increasing the number of participants, however, improves the reliability
of the findings quickly. Drawing 10 participants instead of five, the
simulation uncovered 95% of the problems on average with a lower bound
of 82% of problems identified over 100 simulations. With 15
participants, 97% of the identified problems were uncovered on average,
with a lower bound of 90% found.

So? How many then?

So what's the answer? As always in usability, the answer is "It
depends." The key to effective usability testing is recruiting a truly
representative sample of the target population. Often the test
population will need to represent more than one user group.

That aside, Faulkner's work strongly indicates that a single usability
test with five participants is not enough.

References for this newsletter are posted at:

The Pragmatic Ergonomist, Dr. Eric Schaffer

So for a routine usability test run 12 people for each segment. For an
important one where the stakes are high run 30. If resources are really
tight, you can drop to five-six per segment, but this is bad.

Remember I said "FOR EACH SEGMENT." If you are designing a time
reporting system for health care workers, government employees, lawyers,
and forestry workers, you are making a big mistake if you test just
three in each group. That would be 12 people tested, but the groups are
quite diverse and you need more people from each segment to be
confident. __________________________________________________

HFI IS HIRING: many positions available in Mumbai, India - starting
immediately. For more information, or to apply:

Putting Research into Practice - a yearly seminar on recent research and
its practical application.

The schedule for 2004 seminars:

Suggestions, comments, questions?
HFI editors at mailto:hfi@humanfactors.com.

Want past issues? http://www.humanfactors.com/downloads/pastissues.asp

Subscribe? - http://www.humanfactors.com/downloads/subscribe.asp

Do NOT want this newsletter?
or copy the above URL into the address line of your browser and hit
Received on Thursday, 3 June 2004 15:16:43 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Monday, 7 December 2009 10:47:30 GMT