AW: evaltf-ISSUE-6 (Shadi): Improving objectivity and reliability of the Methodology [WCAG-EM - Website Accessibility Conformance Evaluation Methodology 1.0] from Kerstin Probiesch on 2012-08-23 (public-wai-evaltf@w3.org from August 2012)

From: Kerstin Probiesch <k.probiesch@gmail.com>
Date: Thu, 23 Aug 2012 10:15:31 +0200
To: "'WCAG 2.0 Evaluation Methodology Task Force'" <public-wai-evaltf@w3.org>
Message-ID: <5035e671.5181cc0a.5c59.4277@mx.google.com>
Hi all,

as already written: The three main goodness criteria (sometimes also called "quality criteria") of evaluation methodologies are international agreed and already defined. Considerations about ensuring and improving reliability, objectivity and validity are essential parts of evaluations. They are fundamental an indispensible when it is claimed that it is a "standardized" methodology. And: if someone uses the term "standardized" for a methodology it means, that those criteria are met. 

Therefore: as long as we don't explicitly address them I don't concur the use of the term "standardized", because it implies that the criteria are met in a "sufficient" way and that we have a value for the reliability coefficient and that this value is sufficient. Which we don't have and don't know and which will show controlled test phase of the methodology.

Nevertheless are theses about the goodness criteria and how to improve them part of the development of an evaluation methodology. If those theses are true or false will also show a test phase with the methodology in a controlled setting, for example testing the same website with the same methodology and with different evaluators.

Independent from this: I believe that we are on a very good way to address the goodness criteria so mentioning them explicitly is for me just the next logical step – also for feedback from the public:

Theses:

Sampling: The starting point of an evaluation is not the test itself but the sampling (if needed). The less pages and elements we have in a sample the less reliable and objective the results will be. So the amount of pages and elements correlates with the results of an evaluation.

Possible errors which have or can have an impact:

# Especially for huge pages with probably tens or even hundreds of different editors (for example government websites) we should take care that an evaluator is not just testing the content which was edited by one or two editors. Here I think it is not necessary to check whole pages again and again, because I don't expect any surprises for example the navigation (we have already checked this with our sample of pages which I believe could be sufficient enough). But: The less elements (tables, headings and so on) one is testing the less reliable the results are (thesis). Cause probably another evaluator choose elements on other pages and probably edited by other editors We address this issue but I believe we should address it more, probably in that way for example: test two tables of every section of the website, test 1.3.1 on two pages (here just the articles are meant, because navigation bars are already tested) of every section.

Also an evaluator is not perfect (I mentioned some possible errors already in other mails): not enough time, one likes the design or not, one likes the developers or not (which must not be conscious). There can be effects like: "we know that the agency is making good stuff, therefore we give them some credits in the for-field" (these I believe are often unconscious). One may think that this is uncontrollable but I believe that there are several possibilities to control errors like this: 

- Size of sampled elements)
- A second independent tester who don't belong to the same testing organization, which is essential for improving reliability and objectivity. Problem: the costs will be higher then.
- Pass/fail – we already have chosen this – which I think not only improves reliability and objectivity more than a methodology which works with graded results but ensures that our methodology is on this point valid against what WCAG 2.0 says in "Conformance level".

Therefore I plead in favor for a section where we explicitly address/mention the goodness criteria as next step for implicit actions we have already taken. Probably these lines could be a first step for such a section (with a better english than mine ;-)). And which ends with a sentence like: The TF welcomes comments on this and comments about how to improve the goodness criteria more.

Best

Kerstin

> -----Ursprüngliche Nachricht-----
> Von: WCAG 2.0 Evaluation Methodology Task Force Issue Tracker
> [mailto:sysbot+tracker@w3.org]
> Gesendet: Freitag, 17. August 2012 14:59
> An: public-wai-evaltf@w3.org
> Betreff: evaltf-ISSUE-6 (Shadi): Improving objectivity and reliability
> of the Methodology [WCAG-EM - Website Accessibility Conformance
> Evaluation Methodology 1.0]
> 
> evaltf-ISSUE-6 (Shadi): Improving objectivity and reliability of the
> Methodology [WCAG-EM - Website Accessibility Conformance Evaluation
> Methodology 1.0]
> 
> http://www.w3.org/WAI/ER/2011/eval/track/issues/6
> 
> Raised by: Kerstin Probiesch
> On product: WCAG-EM - Website Accessibility Conformance Evaluation
> Methodology 1.0
> 
> The request is to "include goodness criteria like objectivity and
> reliability", to ensure that the Methodology itself provides a sense of
> objectivity and reliability. However, will including such criteria
> really improve the methodology?
> 
> Note: We will be running the Methodology in practice when we have a
> completed versions (~December 2012), to get feedback on how it performs
> in practice. This includes vagueness regarding objectivity and
> reliability.
>
Received on Thursday, 23 August 2012 08:37:33 UTC