Re: testing with a "High degree of confidence" from Gregg Vanderheiden on 2013-04-02 (w3c-wai-gl@w3.org from April to June 2013)

From: Gregg Vanderheiden <gv@trace.wisc.edu>
Date: Tue, 2 Apr 2013 12:56:32 -0500
To: Alex Li <alli@microsoft.com>
Cc: David MacDonald <david100@sympatico.ca>, Katie Haritos-Shea <ryladog@earthlink.net>, Loretta Guarino Reid <lorettaguarino@google.com>, Michael Cooper <cooper@w3.org>, WCAG WG <w3c-wai-gl@w3.org>, Eval TF <public-wai-evaltf@w3.org>, GreggVan GVAN <gv@trace.wisc.edu>
Message-Id: <FB4F3017-73AC-4E43-AD5F-519FF9661F00@trace.wisc.edu>
Hi Alex, David, all


A few comments on this and the other posts. 

Alex is right about disability having much higher variance than regular research and needing higher N (number of subjects with a disability).

However, this didn’t have to do with subjects with disabilities - it had to do with experts deciding whether a particular web page met a particular success criterion.  So variance of people with disabilities is not relevant.  

Also 8 of 10 was a ratio - not a number.  It didn’t mean testing with 10 and see what 8 said.  it meant 80%.   So 10 as a number isn't relevant either. 



Alex's last comment is closest to the reason.  The 80% was just a number used during the discussion of how reliable it had to be.  

 However saying it was arbitrary is overstating it -- or seems to.  All significance thresholds used in science are actually arbitrary.   

8 of 10 can be a very scientific number-- or it can be a number scientifically tested.    It is in fact simply a criterion.

.01 and .001 are also arbitrary significance levels.  There is nothing scientific about them - they are just probabilities that we have traditionally decided were "good enough" to report.    there is also .005.    And since there are two (or three) values - one might ask "What is the scientific reason one is chosen vs another?"   It all has to do with how "confident" you want to be in the results.    Do you want the probability to be  99%  or  99.5%  or  99.9% sure that you don't have a false positive conclusion. (rejection of null hypothesis).   But the researcher decides which -- or the community decides which -- or the reader decides which is the one they want to use for this or that type of research.  But in the end -- it is someone's or some groups opinion  - as to what the criterion should be for a study or category of study or.. 

Back to WCAG
When talking about agreement of experts - a number of different numbers were tossed around.    9 of 10    8 of 10  etc.    In the end we decided that if you took 10 experts and had them evaluate a page it is unlikely that any would have the exact same eval as another -- with lots of nuance even on individual SC.    And the number actually had nothing to do with compliance.    It only had to do with what the working group was using as its criterion for inclusion.   And that had LOTS of variables, one of which was testability.  Since the WG wasn’t going to actually run a test with 10 or 50 or any number of experts each time we created or edited a success criterion, it didn’t make sense to name a number.   It was done based on the evaluation of the WG - with feedback from the public during review. 

So in the end the "testable" was defined by what the Working Group intended it to mean - and what they were trying to use as their criterion.   And that is the language that was put into the Understanding WCAG 2.0.    

Does this help?

Gregg





On Apr 2, 2013, at 11:56 AM, Alex Li <alli@microsoft.com> wrote:

> David,
>  
> I don’t exactly recall all the reasons, but these are probably most of it.
>  
> First, WCAG 2.0 covers many disabilities.  Even if 10 human subjects is statistically adequate (see the third reason), you need 10 times an undetermined number of disabilities to cover WCAG 2.0. 
>  
> Second, it is hard to establish the knowledge threshold of your human subjects.  Do you get 10 experts, complete novices, a mix, or just random human subjects?  Add to the mix the degree of specialty of the site matters.  Does the human subject have to be financially literate to test a stock trading site, for example?  How do you measure the degree of financially literacy in correspondence to the task?  This issue is a challenge in terms design of experiment.  The difference in result would be dramatic depending on how the experiment was designed.
>  
> Third, 10 is simply inadequate by any statistical measure.  Just a couple of “off-target” human subjects will throw your analysis way off course.  A sample size of a 100 is the bare minimum by rule of thumb.
>  
> Lastly, 80% is plugged out of the air, so to speak.  (I don’t remember if we talked about 8 out of 10, but I’m using your number per this mail.) Why shouldn’t it be 75% or 84.37%?  There is nothing scientifically significant about the number 80%.
>  
> Bottom line is that such approach is at best anecdote and certainly not scientific.  The degree of confidence of the approach would be unacceptably low.  In general, conducting analysis with a small sample size is more suitable for qualitative analyses like focus groups and the like, which generally does not give you a pass/fail result.  (BTW, 10 is still too small for most focus groups.) Hope that helps.
>  
> All best,
> Alex
>  
> From: David MacDonald [mailto:david100@sympatico.ca] 
> Sent: Tuesday, April 02, 2013 9:11 AM
> To: ryladog@earthlink.net; 'Loretta Guarino Reid'; 'Michael Cooper'; 'Gregg Vanderheiden'
> Cc: 'WCAG WG'; 'Eval TF'
> Subject: RE: testing with a "High degree of confidence"
>  
> Thanks Katie
>  
> Can you remember where the vestiges of it ended up in the WCAG documents....if at all?
> I’m just looking to see if we require a high correlation among experts... or simply high level of confidence...perhaps they are not necessarily the same thing.
>  
> Cheers
> David MacDonald
>  
> CanAdapt Solutions Inc.
>   Adapting the web to all users
>             Including those with disabilities
> www.Can-Adapt.com
>  
> From: Katie Haritos-Shea EARTHLINK [mailto:ryladog@earthlink.net] 
> Sent: April-02-13 11:11 AM
> To: 'David MacDonald'; 'Loretta Guarino Reid'; 'Michael Cooper'; 'Gregg Vanderheiden'
> Cc: 'WCAG WG'; 'Eval TF'
> Subject: RE: testing with a "High degree of confidence"
>  
> David,
>  
> I remember these discussions back when – I recall Gregg providing the 8 out of 10  – and – I brought his up for WCAG Evaluation Methodology working group – for their puposes they wanted an algorithmic reference – not human judgment.   Am not sure they found one.
>  
> Katie
>  
> From: David MacDonald [mailto:david100@sympatico.ca] 
> Sent: Monday, April 01, 2013 3:50 PM
> To: Loretta Guarino Reid; Michael Cooper; Gregg Vanderheiden
> Cc: WCAG WG; Eval TF
> Subject: testing with a "High degree of confidence"
>  
> I remember early drafts of WCAG, when discussing human testing we said it was dependable human testing if “8 of 10 testers would come to the same conclusions... ” or something like that...we later changed it to something like “most testers would come to the same conclusions” because we thought the 8 out of 10 rule was a bit prescriptive.
>  
> I’ve been looking for that in the WCAG 2, or the Understanding conformance, Understanding WCAG etc... and didn’t find it.
>  
> The closest I could find was this, but it seems to be more related to automatic testing...
> Does anyone remember the history of that line about “most experts woukd agree...” and where it is now?
>  
> “The Success Criteria can be tested by a combination of machine and human evaluation as long as it is possible to determine whether a Success Criterion has been satisfied with a high level of confidence.”
>  
> http://www.w3.org/TR/UNDERSTANDING-WCAG20/conformance.html#uc-accessibility-support-head
>  
> Cheers
> David MacDonald
>  
> CanAdapt Solutions Inc.
>   Adapting the web to all users
>             Including those with disabilities
> www.Can-Adapt.com
>  
> No virus found in this message.
> Checked by AVG - www.avg.com
> Version: 2013.0.3267 / Virus Database: 3161/6218 - Release Date: 04/01/13
Received on Tuesday, 2 April 2013 17:57:21 UTC