- From: Christophe Strobbe <strobbe@hdm-stuttgart.de>
- Date: Wed, 3 Apr 2013 17:19:41 +0200
- To: "'WCAG WG'" <w3c-wai-gl@w3.org>, "'Eval TF'" <public-wai-evaltf@w3.org>
Hi David, All, Am Di, 2.04.2013, 20:41 schrieb David MacDonald: > Hi Gregg, > > > Do you know where that is, or the statement it got morphed into is found > in > the documents? > The term you should look for his "high inter-rater reliability" (HIRR). The first place I looked was "Requirements for WCAG 2.0" (e.g. <http://www.w3.org/TR/2006/NOTE-wcag2-req-20060425/>) but it isn't listed there. The term can be found in some discussions from August 2002 (e.g. <http://www.w3.org/WAI/GL/meeting-highlights.html>) and in the conformance sections of some older WCAG 2.0 drafts (like <http://www.w3.org/TR/2006/WD-WCAG20-20060427/conformance.html>). I can't find documents with a requirement of 80% HIRR. Best regards, Christophe > > > Cheers > > David MacDonald > > > > From: Gregg Vanderheiden [mailto:gv@trace.wisc.edu] > Sent: April-02-13 1:57 PM > To: Alex Li > Cc: David MacDonald; Katie Haritos-Shea; Loretta Guarino Reid; Michael > Cooper; WCAG WG; Eval TF; GreggVan GVAN > Subject: Re: testing with a "High degree of confidence" > > > > Hi Alex, David, all > > > > > > A few comments on this and the other posts. > > > > Alex is right about disability having much higher variance than regular > research and needing higher N (number of subjects with a disability). > > > > However, this didn't have to do with subjects with disabilities - it had > to > do with experts deciding whether a particular web page met a particular > success criterion. So variance of people with disabilities is not > relevant. > > > > > Also 8 of 10 was a ratio - not a number. It didn't mean testing with 10 > and > see what 8 said. it meant 80%. So 10 as a number isn't relevant either. > > > > > > > > Alex's last comment is closest to the reason. The 80% was just a number > used during the discussion of how reliable it had to be. > > > > However saying it was arbitrary is overstating it -- or seems to. All > significance thresholds used in science are actually arbitrary. > > > > 8 of 10 can be a very scientific number-- or it can be a number > scientifically tested. It is in fact simply a criterion. > > > > .01 and .001 are also arbitrary significance levels. There is nothing > scientific about them - they are just probabilities that we have > traditionally decided were "good enough" to report. there is also .005. > And since there are two (or three) values - one might ask "What is the > scientific reason one is chosen vs another?" It all has to do with how > "confident" you want to be in the results. Do you want the probability > to > be 99% or 99.5% or 99.9% sure that you don't have a false positive > conclusion. (rejection of null hypothesis). But the researcher decides > which -- or the community decides which -- or the reader decides which is > the one they want to use for this or that type of research. But in the > end > -- it is someone's or some groups opinion - as to what the criterion > should > be for a study or category of study or.. > > > > Back to WCAG > > When talking about agreement of experts - a number of different numbers > were > tossed around. 9 of 10 8 of 10 etc. In the end we decided that > if > you took 10 experts and had them evaluate a page it is unlikely that any > would have the exact same eval as another -- with lots of nuance even on > individual SC. And the number actually had nothing to do with > compliance. > It only had to do with what the working group was using as its criterion > for > inclusion. And that had LOTS of variables, one of which was testability. > Since the WG wasn't going to actually run a test with 10 or 50 or any > number > of experts each time we created or edited a success criterion, it didn't > make sense to name a number. It was done based on the evaluation of the > WG > - with feedback from the public during review. > > > > So in the end the "testable" was defined by what the Working Group > intended > it to mean - and what they were trying to use as their criterion. And > that > is the language that was put into the Understanding WCAG 2.0. > > > > Does this help? > > > > Gregg > > > > > > > > > > > > On Apr 2, 2013, at 11:56 AM, Alex Li <alli@microsoft.com> wrote: > > > > > > David, > > > > I don't exactly recall all the reasons, but these are probably most of it. > > > > First, WCAG 2.0 covers many disabilities. Even if 10 human subjects is > statistically adequate (see the third reason), you need 10 times an > undetermined number of disabilities to cover WCAG 2.0. > > > > Second, it is hard to establish the knowledge threshold of your human > subjects. Do you get 10 experts, complete novices, a mix, or just random > human subjects? Add to the mix the degree of specialty of the site > matters. > Does the human subject have to be financially literate to test a stock > trading site, for example? How do you measure the degree of financially > literacy in correspondence to the task? This issue is a challenge in > terms > design of experiment. The difference in result would be dramatic > depending > on how the experiment was designed. > > > > Third, 10 is simply inadequate by any statistical measure. Just a couple > of > "off-target" human subjects will throw your analysis way off course. A > sample size of a 100 is the bare minimum by rule of thumb. > > > > Lastly, 80% is plugged out of the air, so to speak. (I don't remember if > we > talked about 8 out of 10, but I'm using your number per this mail.) Why > shouldn't it be 75% or 84.37%? There is nothing scientifically > significant > about the number 80%. > > > > Bottom line is that such approach is at best anecdote and certainly not > scientific. The degree of confidence of the approach would be > unacceptably > low. In general, conducting analysis with a small sample size is more > suitable for qualitative analyses like focus groups and the like, which > generally does not give you a pass/fail result. (BTW, 10 is still too > small > for most focus groups.) Hope that helps. > > > > All best, > > Alex > > > > From: David MacDonald [mailto:david100@sympatico.ca] > Sent: Tuesday, April 02, 2013 9:11 AM > To: ryladog@earthlink.net; 'Loretta Guarino Reid'; 'Michael Cooper'; > 'Gregg > Vanderheiden' > Cc: 'WCAG WG'; 'Eval TF' > Subject: RE: testing with a "High degree of confidence" > > > > Thanks Katie > > > > Can you remember where the vestiges of it ended up in the WCAG > documents....if at all? > > I'm just looking to see if we require a high correlation among experts... > or > simply high level of confidence...perhaps they are not necessarily the > same > thing. > > > > Cheers > > David MacDonald > > > > CanAdapt Solutions Inc. > > Adapting the web to all users > > Including those with disabilities > > <http://www.can-adapt.com/> www.Can-Adapt.com > > > > From: Katie Haritos-Shea EARTHLINK [ <mailto:ryladog@earthlink.net> > mailto:ryladog@earthlink.net] > Sent: April-02-13 11:11 AM > To: 'David MacDonald'; 'Loretta Guarino Reid'; 'Michael Cooper'; 'Gregg > Vanderheiden' > Cc: 'WCAG WG'; 'Eval TF' > Subject: RE: testing with a "High degree of confidence" > > > > David, > > > > I remember these discussions back when - I recall Gregg providing the 8 > out > of 10 - and - I brought his up for WCAG Evaluation Methodology working > group - for their puposes they wanted an algorithmic reference - not human > judgment. Am not sure they found one. > > > > Katie > > > > From: David MacDonald [ <mailto:david100@sympatico.ca> > mailto:david100@sympatico.ca] > Sent: Monday, April 01, 2013 3:50 PM > To: Loretta Guarino Reid; Michael Cooper; Gregg Vanderheiden > Cc: WCAG WG; Eval TF > Subject: testing with a "High degree of confidence" > > > > I remember early drafts of WCAG, when discussing human testing we said it > was dependable human testing if "8 of 10 testers would come to the same > conclusions... " or something like that...we later changed it to something > like "most testers would come to the same conclusions" because we thought > the 8 out of 10 rule was a bit prescriptive. > > > > I've been looking for that in the WCAG 2, or the Understanding > conformance, > Understanding WCAG etc... and didn't find it. > > > > The closest I could find was this, but it seems to be more related to > automatic testing... > > Does anyone remember the history of that line about "most experts woukd > agree..." and where it is now? > > > > "The Success Criteria can be tested by a combination of machine and human > evaluation as long as it is possible to determine whether a Success > Criterion has been satisfied with a high level of confidence." > > > > > <http://www.w3.org/TR/UNDERSTANDING-WCAG20/conformance.html#uc-accessibility > -support-head> > http://www.w3.org/TR/UNDERSTANDING-WCAG20/conformance.html#uc-accessibility- > support-head > > > > Cheers > > David MacDonald > -- Christophe Strobbe Akademischer Mitarbeiter Adaptive User Interfaces Research Group Hochschule der Medien Nobelstraße 10 70569 Stuttgart Tel. +49 711 8923 2749
Received on Wednesday, 3 April 2013 15:20:29 UTC