RE: testing with a "High degree of confidence" from David MacDonald on 2013-04-02 (public-wai-evaltf@w3.org from April 2013)

From: David MacDonald <david100@sympatico.ca>
Date: Tue, 2 Apr 2013 14:41:52 -0400
To: "'Gregg Vanderheiden'" <gv@trace.wisc.edu>, "'Alex Li'" <alli@microsoft.com>
CC: "'Katie Haritos-Shea'" <ryladog@earthlink.net>, "'Loretta Guarino Reid'" <lorettaguarino@google.com>, "'Michael Cooper'" <cooper@w3.org>, "'WCAG WG'" <w3c-wai-gl@w3.org>, "'Eval TF'" <public-wai-evaltf@w3.org>
Message-ID: <BLU0-SMTP594CD10E570262DC8BB7A6FEDF0@phx.gbl>
Hi Gregg, 

 

Do you know where that is, or the statement it got morphed into is found in
the documents?

 

Cheers

David MacDonald

 

CanAdapt Solutions Inc.

  Adapting the web to all users

            Including those with disabilities

 <http://www.can-adapt.com/> www.Can-Adapt.com

 

From: Gregg Vanderheiden [mailto:gv@trace.wisc.edu] 
Sent: April-02-13 1:57 PM
To: Alex Li
Cc: David MacDonald; Katie Haritos-Shea; Loretta Guarino Reid; Michael
Cooper; WCAG WG; Eval TF; GreggVan GVAN
Subject: Re: testing with a "High degree of confidence"

 

Hi Alex, David, all

 

 

A few comments on this and the other posts. 

 

Alex is right about disability having much higher variance than regular
research and needing higher N (number of subjects with a disability).

 

However, this didn't have to do with subjects with disabilities - it had to
do with experts deciding whether a particular web page met a particular
success criterion.  So variance of people with disabilities is not relevant.


 

Also 8 of 10 was a ratio - not a number.  It didn't mean testing with 10 and
see what 8 said.  it meant 80%.   So 10 as a number isn't relevant either. 

 

 

 

Alex's last comment is closest to the reason.  The 80% was just a number
used during the discussion of how reliable it had to be.  

 

 However saying it was arbitrary is overstating it -- or seems to.  All
significance thresholds used in science are actually arbitrary.   

 

8 of 10 can be a very scientific number-- or it can be a number
scientifically tested.    It is in fact simply a criterion.

 

.01 and .001 are also arbitrary significance levels.  There is nothing
scientific about them - they are just probabilities that we have
traditionally decided were "good enough" to report.    there is also .005.
And since there are two (or three) values - one might ask "What is the
scientific reason one is chosen vs another?"   It all has to do with how
"confident" you want to be in the results.    Do you want the probability to
be  99%  or  99.5%  or  99.9% sure that you don't have a false positive
conclusion. (rejection of null hypothesis).   But the researcher decides
which -- or the community decides which -- or the reader decides which is
the one they want to use for this or that type of research.  But in the end
-- it is someone's or some groups opinion  - as to what the criterion should
be for a study or category of study or.. 

 

Back to WCAG

When talking about agreement of experts - a number of different numbers were
tossed around.    9 of 10    8 of 10  etc.    In the end we decided that if
you took 10 experts and had them evaluate a page it is unlikely that any
would have the exact same eval as another -- with lots of nuance even on
individual SC.    And the number actually had nothing to do with compliance.
It only had to do with what the working group was using as its criterion for
inclusion.   And that had LOTS of variables, one of which was testability.
Since the WG wasn't going to actually run a test with 10 or 50 or any number
of experts each time we created or edited a success criterion, it didn't
make sense to name a number.   It was done based on the evaluation of the WG
- with feedback from the public during review. 

 

So in the end the "testable" was defined by what the Working Group intended
it to mean - and what they were trying to use as their criterion.   And that
is the language that was put into the Understanding WCAG 2.0.    

 

Does this help?

 

Gregg

 

 

 

 

 

On Apr 2, 2013, at 11:56 AM, Alex Li <alli@microsoft.com> wrote:





David,

 

I don't exactly recall all the reasons, but these are probably most of it.

 

First, WCAG 2.0 covers many disabilities.  Even if 10 human subjects is
statistically adequate (see the third reason), you need 10 times an
undetermined number of disabilities to cover WCAG 2.0. 

 

Second, it is hard to establish the knowledge threshold of your human
subjects.  Do you get 10 experts, complete novices, a mix, or just random
human subjects?  Add to the mix the degree of specialty of the site matters.
Does the human subject have to be financially literate to test a stock
trading site, for example?  How do you measure the degree of financially
literacy in correspondence to the task?  This issue is a challenge in terms
design of experiment.  The difference in result would be dramatic depending
on how the experiment was designed.

 

Third, 10 is simply inadequate by any statistical measure.  Just a couple of
"off-target" human subjects will throw your analysis way off course.  A
sample size of a 100 is the bare minimum by rule of thumb.

 

Lastly, 80% is plugged out of the air, so to speak.  (I don't remember if we
talked about 8 out of 10, but I'm using your number per this mail.) Why
shouldn't it be 75% or 84.37%?  There is nothing scientifically significant
about the number 80%.

 

Bottom line is that such approach is at best anecdote and certainly not
scientific.  The degree of confidence of the approach would be unacceptably
low.  In general, conducting analysis with a small sample size is more
suitable for qualitative analyses like focus groups and the like, which
generally does not give you a pass/fail result.  (BTW, 10 is still too small
for most focus groups.) Hope that helps.

 

All best,

Alex

 

From: David MacDonald [mailto:david100@sympatico.ca] 
Sent: Tuesday, April 02, 2013 9:11 AM
To: ryladog@earthlink.net; 'Loretta Guarino Reid'; 'Michael Cooper'; 'Gregg
Vanderheiden'
Cc: 'WCAG WG'; 'Eval TF'
Subject: RE: testing with a "High degree of confidence"

 

Thanks Katie

 

Can you remember where the vestiges of it ended up in the WCAG
documents....if at all?

I'm just looking to see if we require a high correlation among experts... or
simply high level of confidence...perhaps they are not necessarily the same
thing.

 

Cheers

David MacDonald

 

CanAdapt Solutions Inc.

  Adapting the web to all users

            Including those with disabilities

 <http://www.can-adapt.com/> www.Can-Adapt.com

 

From: Katie Haritos-Shea EARTHLINK [ <mailto:ryladog@earthlink.net>
mailto:ryladog@earthlink.net] 
Sent: April-02-13 11:11 AM
To: 'David MacDonald'; 'Loretta Guarino Reid'; 'Michael Cooper'; 'Gregg
Vanderheiden'
Cc: 'WCAG WG'; 'Eval TF'
Subject: RE: testing with a "High degree of confidence"

 

David,

 

I remember these discussions back when - I recall Gregg providing the 8 out
of 10  - and - I brought his up for WCAG Evaluation Methodology working
group - for their puposes they wanted an algorithmic reference - not human
judgment.   Am not sure they found one.

 

Katie

 

From: David MacDonald [ <mailto:david100@sympatico.ca>
mailto:david100@sympatico.ca] 
Sent: Monday, April 01, 2013 3:50 PM
To: Loretta Guarino Reid; Michael Cooper; Gregg Vanderheiden
Cc: WCAG WG; Eval TF
Subject: testing with a "High degree of confidence"

 

I remember early drafts of WCAG, when discussing human testing we said it
was dependable human testing if "8 of 10 testers would come to the same
conclusions... " or something like that...we later changed it to something
like "most testers would come to the same conclusions" because we thought
the 8 out of 10 rule was a bit prescriptive.

 

I've been looking for that in the WCAG 2, or the Understanding conformance,
Understanding WCAG etc... and didn't find it.

 

The closest I could find was this, but it seems to be more related to
automatic testing...

Does anyone remember the history of that line about "most experts woukd
agree..." and where it is now?

 

"The Success Criteria can be tested by a combination of machine and human
evaluation as long as it is possible to determine whether a Success
Criterion has been satisfied with a high level of confidence."

 

 
<http://www.w3.org/TR/UNDERSTANDING-WCAG20/conformance.html#uc-accessibility
-support-head>
http://www.w3.org/TR/UNDERSTANDING-WCAG20/conformance.html#uc-accessibility-
support-head

 

Cheers

David MacDonald

 

CanAdapt Solutions Inc.

  Adapting the web to all users

            Including those with disabilities

 <http://www.can-adapt.com/> www.Can-Adapt.com

 

No virus found in this message.
Checked by AVG -  <http://www.avg.com> www.avg.com
Version: 2013.0.3267 / Virus Database: 3161/6218 - Release Date: 04/01/13
Received on Tuesday, 2 April 2013 18:42:32 UTC