RE: testing with a "High degree of confidence" from Christophe Strobbe on 2013-04-03 (w3c-wai-gl@w3.org from April to June 2013)

From: Christophe Strobbe <strobbe@hdm-stuttgart.de>
Date: Wed, 3 Apr 2013 17:19:41 +0200
To: "'WCAG WG'" <w3c-wai-gl@w3.org>, "'Eval TF'" <public-wai-evaltf@w3.org>
Message-ID: <afff7e34b169cc00fb3d44ec7b2a0379.squirrel@mail.hdm-stuttgart.de>
Hi David, All,

Am Di, 2.04.2013, 20:41 schrieb David MacDonald:
> Hi Gregg,
>
>
> Do you know where that is, or the statement it got morphed into is found
> in
> the documents?
>

The term you should look for his "high inter-rater reliability" (HIRR).
The first place I looked was "Requirements for WCAG 2.0" (e.g.
<http://www.w3.org/TR/2006/NOTE-wcag2-req-20060425/>) but it isn't listed
there.
The term can be found in some discussions from August 2002 (e.g.
<http://www.w3.org/WAI/GL/meeting-highlights.html>) and in the conformance
sections of some older WCAG 2.0 drafts (like
<http://www.w3.org/TR/2006/WD-WCAG20-20060427/conformance.html>).
I can't find documents with a requirement of 80% HIRR.

Best regards,

Christophe


>
>
> Cheers
>
> David MacDonald
>
>
>
> From: Gregg Vanderheiden [mailto:gv@trace.wisc.edu]
> Sent: April-02-13 1:57 PM
> To: Alex Li
> Cc: David MacDonald; Katie Haritos-Shea; Loretta Guarino Reid; Michael
> Cooper; WCAG WG; Eval TF; GreggVan GVAN
> Subject: Re: testing with a "High degree of confidence"
>
>
>
> Hi Alex, David, all
>
>
>
>
>
> A few comments on this and the other posts.
>
>
>
> Alex is right about disability having much higher variance than regular
> research and needing higher N (number of subjects with a disability).
>
>
>
> However, this didn't have to do with subjects with disabilities - it had
> to
> do with experts deciding whether a particular web page met a particular
> success criterion.  So variance of people with disabilities is not
> relevant.
>
>
>
>
> Also 8 of 10 was a ratio - not a number.  It didn't mean testing with 10
> and
> see what 8 said.  it meant 80%.   So 10 as a number isn't relevant either.
>
>
>
>
>
>
>
> Alex's last comment is closest to the reason.  The 80% was just a number
> used during the discussion of how reliable it had to be.
>
>
>
>  However saying it was arbitrary is overstating it -- or seems to.  All
> significance thresholds used in science are actually arbitrary.
>
>
>
> 8 of 10 can be a very scientific number-- or it can be a number
> scientifically tested.    It is in fact simply a criterion.
>
>
>
> .01 and .001 are also arbitrary significance levels.  There is nothing
> scientific about them - they are just probabilities that we have
> traditionally decided were "good enough" to report.    there is also .005.
> And since there are two (or three) values - one might ask "What is the
> scientific reason one is chosen vs another?"   It all has to do with how
> "confident" you want to be in the results.    Do you want the probability
> to
> be  99%  or  99.5%  or  99.9% sure that you don't have a false positive
> conclusion. (rejection of null hypothesis).   But the researcher decides
> which -- or the community decides which -- or the reader decides which is
> the one they want to use for this or that type of research.  But in the
> end
> -- it is someone's or some groups opinion  - as to what the criterion
> should
> be for a study or category of study or..
>
>
>
> Back to WCAG
>
> When talking about agreement of experts - a number of different numbers
> were
> tossed around.    9 of 10    8 of 10  etc.    In the end we decided that
> if
> you took 10 experts and had them evaluate a page it is unlikely that any
> would have the exact same eval as another -- with lots of nuance even on
> individual SC.    And the number actually had nothing to do with
> compliance.
> It only had to do with what the working group was using as its criterion
> for
> inclusion.   And that had LOTS of variables, one of which was testability.
> Since the WG wasn't going to actually run a test with 10 or 50 or any
> number
> of experts each time we created or edited a success criterion, it didn't
> make sense to name a number.   It was done based on the evaluation of the
> WG
> - with feedback from the public during review.
>
>
>
> So in the end the "testable" was defined by what the Working Group
> intended
> it to mean - and what they were trying to use as their criterion.   And
> that
> is the language that was put into the Understanding WCAG 2.0.
>
>
>
> Does this help?
>
>
>
> Gregg
>
>
>
>
>
>
>
>
>
>
>
> On Apr 2, 2013, at 11:56 AM, Alex Li <alli@microsoft.com> wrote:
>
>
>
>
>
> David,
>
>
>
> I don't exactly recall all the reasons, but these are probably most of it.
>
>
>
> First, WCAG 2.0 covers many disabilities.  Even if 10 human subjects is
> statistically adequate (see the third reason), you need 10 times an
> undetermined number of disabilities to cover WCAG 2.0.
>
>
>
> Second, it is hard to establish the knowledge threshold of your human
> subjects.  Do you get 10 experts, complete novices, a mix, or just random
> human subjects?  Add to the mix the degree of specialty of the site
> matters.
> Does the human subject have to be financially literate to test a stock
> trading site, for example?  How do you measure the degree of financially
> literacy in correspondence to the task?  This issue is a challenge in
> terms
> design of experiment.  The difference in result would be dramatic
> depending
> on how the experiment was designed.
>
>
>
> Third, 10 is simply inadequate by any statistical measure.  Just a couple
> of
> "off-target" human subjects will throw your analysis way off course.  A
> sample size of a 100 is the bare minimum by rule of thumb.
>
>
>
> Lastly, 80% is plugged out of the air, so to speak.  (I don't remember if
> we
> talked about 8 out of 10, but I'm using your number per this mail.) Why
> shouldn't it be 75% or 84.37%?  There is nothing scientifically
> significant
> about the number 80%.
>
>
>
> Bottom line is that such approach is at best anecdote and certainly not
> scientific.  The degree of confidence of the approach would be
> unacceptably
> low.  In general, conducting analysis with a small sample size is more
> suitable for qualitative analyses like focus groups and the like, which
> generally does not give you a pass/fail result.  (BTW, 10 is still too
> small
> for most focus groups.) Hope that helps.
>
>
>
> All best,
>
> Alex
>
>
>
> From: David MacDonald [mailto:david100@sympatico.ca]
> Sent: Tuesday, April 02, 2013 9:11 AM
> To: ryladog@earthlink.net; 'Loretta Guarino Reid'; 'Michael Cooper';
> 'Gregg
> Vanderheiden'
> Cc: 'WCAG WG'; 'Eval TF'
> Subject: RE: testing with a "High degree of confidence"
>
>
>
> Thanks Katie
>
>
>
> Can you remember where the vestiges of it ended up in the WCAG
> documents....if at all?
>
> I'm just looking to see if we require a high correlation among experts...
> or
> simply high level of confidence...perhaps they are not necessarily the
> same
> thing.
>
>
>
> Cheers
>
> David MacDonald
>
>
>
> CanAdapt Solutions Inc.
>
>   Adapting the web to all users
>
>             Including those with disabilities
>
>  <http://www.can-adapt.com/> www.Can-Adapt.com
>
>
>
> From: Katie Haritos-Shea EARTHLINK [ <mailto:ryladog@earthlink.net>
> mailto:ryladog@earthlink.net]
> Sent: April-02-13 11:11 AM
> To: 'David MacDonald'; 'Loretta Guarino Reid'; 'Michael Cooper'; 'Gregg
> Vanderheiden'
> Cc: 'WCAG WG'; 'Eval TF'
> Subject: RE: testing with a "High degree of confidence"
>
>
>
> David,
>
>
>
> I remember these discussions back when - I recall Gregg providing the 8
> out
> of 10  - and - I brought his up for WCAG Evaluation Methodology working
> group - for their puposes they wanted an algorithmic reference - not human
> judgment.   Am not sure they found one.
>
>
>
> Katie
>
>
>
> From: David MacDonald [ <mailto:david100@sympatico.ca>
> mailto:david100@sympatico.ca]
> Sent: Monday, April 01, 2013 3:50 PM
> To: Loretta Guarino Reid; Michael Cooper; Gregg Vanderheiden
> Cc: WCAG WG; Eval TF
> Subject: testing with a "High degree of confidence"
>
>
>
> I remember early drafts of WCAG, when discussing human testing we said it
> was dependable human testing if "8 of 10 testers would come to the same
> conclusions... " or something like that...we later changed it to something
> like "most testers would come to the same conclusions" because we thought
> the 8 out of 10 rule was a bit prescriptive.
>
>
>
> I've been looking for that in the WCAG 2, or the Understanding
> conformance,
> Understanding WCAG etc... and didn't find it.
>
>
>
> The closest I could find was this, but it seems to be more related to
> automatic testing...
>
> Does anyone remember the history of that line about "most experts woukd
> agree..." and where it is now?
>
>
>
> "The Success Criteria can be tested by a combination of machine and human
> evaluation as long as it is possible to determine whether a Success
> Criterion has been satisfied with a high level of confidence."
>
>
>
>
> <http://www.w3.org/TR/UNDERSTANDING-WCAG20/conformance.html#uc-accessibility
> -support-head>
> http://www.w3.org/TR/UNDERSTANDING-WCAG20/conformance.html#uc-accessibility-
> support-head
>
>
>
> Cheers
>
> David MacDonald
>

-- 
Christophe Strobbe
Akademischer Mitarbeiter
Adaptive User Interfaces Research Group
Hochschule der Medien
Nobelstraße 10
70569 Stuttgart
Tel. +49 711 8923 2749
Received on Wednesday, 3 April 2013 15:20:29 UTC