RE: testing with a "High degree of confidence" from David MacDonald on 2013-04-03 (public-wai-evaltf@w3.org from April 2013)

From: David MacDonald <david100@sympatico.ca>
Date: Wed, 3 Apr 2013 12:16:41 -0400
To: "'Christophe Strobbe'" <strobbe@hdm-stuttgart.de>, "'WCAG WG'" <w3c-wai-gl@w3.org>, "'Eval TF'" <public-wai-evaltf@w3.org>
Message-ID: <BLU0-SMTP564CF2085B0BDA26EEA1BCFED80@phx.gbl>
Thanks Christophe

That sounds right.

Cheers
David MacDonald

CanAdapt Solutions Inc.
  Adapting the web to all users
            Including those with disabilities
www.Can-Adapt.com


-----Original Message-----
From: Christophe Strobbe [mailto:strobbe@hdm-stuttgart.de] 
Sent: April-03-13 11:43 AM
To: 'WCAG WG'; 'Eval TF'
Subject: RE: testing with a "High degree of confidence"



Am Mi, 3.04.2013, 17:19 schrieb Christophe Strobbe:
> Hi David, All,
>
> Am Di, 2.04.2013, 20:41 schrieb David MacDonald:
>> Hi Gregg,
>>
>>
>> Do you know where that is, or the statement it got morphed into is 
>> found in the documents?
>>
>
> The term you should look for his "high inter-rater reliability" (HIRR).
> The first place I looked was "Requirements for WCAG 2.0" (e.g.
> <http://www.w3.org/TR/2006/NOTE-wcag2-req-20060425/>) but it isn't listed
> there.
> The term can be found in some discussions from August 2002 (e.g.
> <http://www.w3.org/WAI/GL/meeting-highlights.html>) and in the conformance
> sections of some older WCAG 2.0 drafts (like
> <http://www.w3.org/TR/2006/WD-WCAG20-20060427/conformance.html>).
> I can't find documents with a requirement of 80% HIRR.

I looked a bit harder.
The April 2006 draft I referred to contains the following statement: "When
people who understand WCAG 2.0 test the same content using the same
success criteria, the same results should be obtained with high
inter-rater reliability."

There were a few comments about this, like LC-1267 from Andrew Arch
<http://www.w3.org/WAI/GL/WCAG20/issue-tracking/viewdata_individual.php?id=1267>
and LC-1212 from Al Gilman
<http://www.w3.org/WAI/GL/WCAG20/issue-tracking/viewdata_guidelines.php#1212>.

The phrases "people who understand WCAG" and "inter-rater reliability"
were removed from the May 2007 draft:
<http://www.w3.org/TR/2007/WD-WCAG20-20070517/#overview-sc>. So it became:
"The same results should be obtained with a high level of confidence when
people who understand how people with different types of disabilities use
the Web test the same content."

Later drafts (I did not check when) removed that statement and just said
that the success criteria are testable statements.

Best regards,

Christophe

>
>>
>>
>> From: Gregg Vanderheiden [mailto:gv@trace.wisc.edu]
>> Sent: April-02-13 1:57 PM
>> To: Alex Li
>> Cc: David MacDonald; Katie Haritos-Shea; Loretta Guarino Reid; Michael
>> Cooper; WCAG WG; Eval TF; GreggVan GVAN
>> Subject: Re: testing with a "High degree of confidence"
>>
>>
>>
>> Hi Alex, David, all
>>
>>
>>
>>
>>
>> A few comments on this and the other posts.
>>
>>
>>
>> Alex is right about disability having much higher variance than regular
>> research and needing higher N (number of subjects with a disability).
>>
>>
>>
>> However, this didn't have to do with subjects with disabilities - it had
>> to
>> do with experts deciding whether a particular web page met a particular
>> success criterion.  So variance of people with disabilities is not
>> relevant.
>>
>>
>>
>>
>> Also 8 of 10 was a ratio - not a number.  It didn't mean testing with 10
>> and
>> see what 8 said.  it meant 80%.   So 10 as a number isn't relevant
>> either.
>>
>>
>>
>>
>>
>>
>>
>> Alex's last comment is closest to the reason.  The 80% was just a number
>> used during the discussion of how reliable it had to be.
>>
>>
>>
>>  However saying it was arbitrary is overstating it -- or seems to.  All
>> significance thresholds used in science are actually arbitrary.
>>
>>
>>
>> 8 of 10 can be a very scientific number-- or it can be a number
>> scientifically tested.    It is in fact simply a criterion.
>>
>>
>>
>> .01 and .001 are also arbitrary significance levels.  There is nothing
>> scientific about them - they are just probabilities that we have
>> traditionally decided were "good enough" to report.    there is also
>> .005.
>> And since there are two (or three) values - one might ask "What is the
>> scientific reason one is chosen vs another?"   It all has to do with how
>> "confident" you want to be in the results.    Do you want the
>> probability
>> to
>> be  99%  or  99.5%  or  99.9% sure that you don't have a false positive
>> conclusion. (rejection of null hypothesis).   But the researcher decides
>> which -- or the community decides which -- or the reader decides which
>> is
>> the one they want to use for this or that type of research.  But in the
>> end
>> -- it is someone's or some groups opinion  - as to what the criterion
>> should
>> be for a study or category of study or..
>>
>>
>>
>> Back to WCAG
>>
>> When talking about agreement of experts - a number of different numbers
>> were
>> tossed around.    9 of 10    8 of 10  etc.    In the end we decided that
>> if
>> you took 10 experts and had them evaluate a page it is unlikely that any
>> would have the exact same eval as another -- with lots of nuance even on
>> individual SC.    And the number actually had nothing to do with
>> compliance.
>> It only had to do with what the working group was using as its criterion
>> for
>> inclusion.   And that had LOTS of variables, one of which was
>> testability.
>> Since the WG wasn't going to actually run a test with 10 or 50 or any
>> number
>> of experts each time we created or edited a success criterion, it didn't
>> make sense to name a number.   It was done based on the evaluation of
>> the
>> WG
>> - with feedback from the public during review.
>>
>>
>>
>> So in the end the "testable" was defined by what the Working Group
>> intended
>> it to mean - and what they were trying to use as their criterion.   And
>> that
>> is the language that was put into the Understanding WCAG 2.0.
>>
>>
>>
>> Does this help?
>>
>>
>>
>> Gregg
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On Apr 2, 2013, at 11:56 AM, Alex Li <alli@microsoft.com> wrote:
>>
>>
>>
>>
>>
>> David,
>>
>>
>>
>> I don't exactly recall all the reasons, but these are probably most of
>> it.
>>
>>
>>
>> First, WCAG 2.0 covers many disabilities.  Even if 10 human subjects is
>> statistically adequate (see the third reason), you need 10 times an
>> undetermined number of disabilities to cover WCAG 2.0.
>>
>>
>>
>> Second, it is hard to establish the knowledge threshold of your human
>> subjects.  Do you get 10 experts, complete novices, a mix, or just
>> random
>> human subjects?  Add to the mix the degree of specialty of the site
>> matters.
>> Does the human subject have to be financially literate to test a stock
>> trading site, for example?  How do you measure the degree of financially
>> literacy in correspondence to the task?  This issue is a challenge in
>> terms
>> design of experiment.  The difference in result would be dramatic
>> depending
>> on how the experiment was designed.
>>
>>
>>
>> Third, 10 is simply inadequate by any statistical measure.  Just a
>> couple
>> of
>> "off-target" human subjects will throw your analysis way off course.  A
>> sample size of a 100 is the bare minimum by rule of thumb.
>>
>>
>>
>> Lastly, 80% is plugged out of the air, so to speak.  (I don't remember
>> if
>> we
>> talked about 8 out of 10, but I'm using your number per this mail.) Why
>> shouldn't it be 75% or 84.37%?  There is nothing scientifically
>> significant
>> about the number 80%.
>>
>>
>>
>> Bottom line is that such approach is at best anecdote and certainly not
>> scientific.  The degree of confidence of the approach would be
>> unacceptably
>> low.  In general, conducting analysis with a small sample size is more
>> suitable for qualitative analyses like focus groups and the like, which
>> generally does not give you a pass/fail result.  (BTW, 10 is still too
>> small
>> for most focus groups.) Hope that helps.
>>
>>
>>
>> All best,
>>
>> Alex
>>
>>
>>
>> From: David MacDonald [mailto:david100@sympatico.ca]
>> Sent: Tuesday, April 02, 2013 9:11 AM
>> To: ryladog@earthlink.net; 'Loretta Guarino Reid'; 'Michael Cooper';
>> 'Gregg
>> Vanderheiden'
>> Cc: 'WCAG WG'; 'Eval TF'
>> Subject: RE: testing with a "High degree of confidence"
>>
>>
>>
>> Thanks Katie
>>
>>
>>
>> Can you remember where the vestiges of it ended up in the WCAG
>> documents....if at all?
>>
>> I'm just looking to see if we require a high correlation among
>> experts...
>> or
>> simply high level of confidence...perhaps they are not necessarily the
>> same
>> thing.
>>
>>
>>
>> Cheers
>>
>> David MacDonald
>>
>>
>>
>> CanAdapt Solutions Inc.
>>
>>   Adapting the web to all users
>>
>>             Including those with disabilities
>>
>>  <http://www.can-adapt.com/> www.Can-Adapt.com
>>
>>
>>
>> From: Katie Haritos-Shea EARTHLINK [ <mailto:ryladog@earthlink.net>
>> mailto:ryladog@earthlink.net]
>> Sent: April-02-13 11:11 AM
>> To: 'David MacDonald'; 'Loretta Guarino Reid'; 'Michael Cooper'; 'Gregg
>> Vanderheiden'
>> Cc: 'WCAG WG'; 'Eval TF'
>> Subject: RE: testing with a "High degree of confidence"
>>
>>
>>
>> David,
>>
>>
>>
>> I remember these discussions back when - I recall Gregg providing the 8
>> out
>> of 10  - and - I brought his up for WCAG Evaluation Methodology working
>> group - for their puposes they wanted an algorithmic reference - not
>> human
>> judgment.   Am not sure they found one.
>>
>>
>>
>> Katie
>>
>>
>>
>> From: David MacDonald [ <mailto:david100@sympatico.ca>
>> mailto:david100@sympatico.ca]
>> Sent: Monday, April 01, 2013 3:50 PM
>> To: Loretta Guarino Reid; Michael Cooper; Gregg Vanderheiden
>> Cc: WCAG WG; Eval TF
>> Subject: testing with a "High degree of confidence"
>>
>>
>>
>> I remember early drafts of WCAG, when discussing human testing we said
>> it
>> was dependable human testing if "8 of 10 testers would come to the same
>> conclusions... " or something like that...we later changed it to
>> something
>> like "most testers would come to the same conclusions" because we
>> thought
>> the 8 out of 10 rule was a bit prescriptive.
>>
>>
>>
>> I've been looking for that in the WCAG 2, or the Understanding
>> conformance,
>> Understanding WCAG etc... and didn't find it.
>>
>>
>>
>> The closest I could find was this, but it seems to be more related to
>> automatic testing...
>>
>> Does anyone remember the history of that line about "most experts woukd
>> agree..." and where it is now?
>>
>>
>>
>> "The Success Criteria can be tested by a combination of machine and
>> human
>> evaluation as long as it is possible to determine whether a Success
>> Criterion has been satisfied with a high level of confidence."
>>
>>
>>
>>
>> <http://www.w3.org/TR/UNDERSTANDING-WCAG20/conformance.html#uc-accessibility
>> -support-head>
>> http://www.w3.org/TR/UNDERSTANDING-WCAG20/conformance.html#uc-accessibility-
>> support-head
>>
>>
>>
>> Cheers
>>
>> David MacDonald
>>
>
> --
> Christophe Strobbe
> Akademischer Mitarbeiter
> Adaptive User Interfaces Research Group
> Hochschule der Medien
> Nobelstra e 10
> 70569 Stuttgart
> Tel. +49 711 8923 2749
>
>
>


-- 
Christophe Strobbe
Akademischer Mitarbeiter
Adaptive User Interfaces Research Group
Hochschule der Medien
Nobelstra e 10
70569 Stuttgart
Tel. +49 711 8923 2749
Received on Wednesday, 3 April 2013 16:17:20 UTC