Direction of methodology development from Detlev Fischer on 2011-11-14 (public-wai-evaltf@w3.org from November 2011)

From: Detlev Fischer <fischer@dias.de>
Date: Mon, 14 Nov 2011 16:05:28 +0100
To: public-wai-evaltf@w3.org
Message-ID: <4EC12E38.7000601@dias.de>
Hi all,

The teleconference on 4 November has changed my perception of what our 
evaluation methodology is likely to accomplish.

There seems to be consensus that the test procedure itself (stepping 
through the WCAG success criteria on the chosen level and checking web 
content against it) is not going to be covered in any detail. References 
to the quickref and the techniques provided under it will be used for that.

Reason: The WCAG 2.0 techniques contain tests with pass/fail conditions 
in sufficient detail; this work should not be replicated or aggregated 
in order to avoid versioning / consistency issues and save maintenance 
effort.

The methodology will instead focus on other aspects such as page 
sampling and  setting the scope of conformance claims. It will also 
(probably?) propose a concept for tolerance metrics for deciding whether 
web content under test should pass or fail a success criterion. This 
would address questions like whether it should be possible for content 
to pass a SC even in the case of minor violations. Just as an example: 
Content may still pass SC 1.3.1 Info and relationships if a short list 
in some text content on a page is not properly marked up as list. The 
test would tolerate such minor violations. (Before you lunge at this: I 
am not suggesting it should; this is just to sketch a possible outcome 
of tolerance metrics.)

Consequent discussions on the list indicated that the methodology might 
also deal with the issue of managing / rating multiple failures, i.e. 
web content that simultaneously fails several criteria.

For me, one consequence of the consensus sketched is that I do no longer 
think it necessary to separate a document covering the evaluation 
procedure from a document covering the context (rationale, references, 
glossary, qualification of testers, etc).

Why? In all likelihood, actual testers won't have our methodologgy on 
their lap. They will be using separate hands-on tools to guide their 
evaluation on the level of indiviadual success criteria, including the 
option to enter relevant comments about problems / violations.

Such a hands-on tool could be a web application like BITV-Test (maybe 
suitably modified to address all three levels of WCAG), or it could be 
an Excel-based document like the access-for-all spreadsheet (Checkliste 
für barrierefreies Webdesign 2.0, http://url.ie/di0d ).

The methodology, as I now see it, will act as a framework or spec for 
hands-on tools based on it.

What will be absent though is any tangible advice regarding the actual 
rating of content under test. Hands-on tools may provide further help 
here which is no longer guided by our methodology, or only in broad terms.

Now, lets look at the feasibility of tolerance metrics without the 
hands-on bit.


Presumably, a case-by-case assessment of the severity of SC violations 
should enter any tolerance metrics before aggregating SC on the level of 
conformance claim. It now seems that our method will eschew altogether 
delving into success criteria and WCAG techniques and failures. By the 
same token, it cannot give any concrete guidance for assessing actual 
web content against success criteria.

What's left of tolerance metrics? What can a generic section on 
tolerance metrics achieve? If it states, for example, that it is 
generally acceptable for content to have a share X of non-critical 
violations and still pass the SC, this leaves it to the tester to 
determine whether or not content in question qualifies as non-critical. 
Also, in many cases, a judgement of a quantitative share of violations 
against successful implementations is simply not possible.

How can this issue be solved? If the outcome of recognising the 
difficulty of judging the criticality or degree of violations is that 
EVAL TF simply decides that *any* violation should fail a success 
criterion, we would end up with hardy any site passing.

In my view, a methodology that avoids the procedure 'on the ground' 
trailing through SC and related failures and techniques, can certainly 
define some points that have not been sufficiently elaborated up to now 
(e.g. setting the scope of a claim, and sampling pages). I am resigned 
to think this can be useful.

Such a methodology will not, however, solve the fundamental problem of 
assessing usually less-than-perfect web content: deciding on the 
criticality of violations and determining whether on the whole, a page 
should fail or pass, even with minor violations. Since that level of 
analysis is simply not covered, this means that it is up to testers to 
determine all that on their own. And all that will aggregate upwards.

Goodbye, then, reliability. And say hello to your stern sister, 
replicability.




Am 14.11.2011 09:19, schrieb Shadi Abou-Zahra:
> Eval TF,
>
> Please find the minutes for the teleconference on 10 November 2011:
> - <http://www.w3.org/2011/11/10-eval-minutes.html>
>
> Next meeting: Thursday 17 November 2011
>
>
> Regards,
> Shadi
>


-- 
---------------------------------------------------------------
Detlev Fischer PhD
DIAS GmbH - Daten, Informationssysteme und Analysen im Sozialen
Geschäftsführung: Thomas Lilienthal, Michael Zapp

Telefon: +49-40-43 18 75-25
Mobile: +49-157 7-170 73 84
Fax: +49-40-43 18 75-19
E-Mail: fischer@dias.de

Anschrift: Schulterblatt 36, D-20357 Hamburg
Amtsgericht Hamburg HRB 58 167
Geschäftsführer: Thomas Lilienthal, Michael Zapp
---------------------------------------------------------------
Received on Monday, 14 November 2011 15:06:14 UTC