Re: [Disposition of Comments] Working Draft, 30 January 2014 from Detlev Fischer on 2014-03-16 (public-wai-evaltf@w3.org from March 2014)

From: Detlev Fischer <detlev.fischer@testkreis.de>
Date: Sun, 16 Mar 2014 15:23:27 +0100
To: Shadi Abou-Zahra <shadi@w3.org>
Cc: Eval TF <public-wai-evaltf@w3.org>
Message-Id: <B64AE178-CD1F-4C59-9907-9E777B25A7EF@testkreis.de>
Hi group,

I went through the latest dispositon of comments (DoC):
- http://www.w3.org/WAI/ER/conformance/comments-20140130
Here are my thoughts in general, and then my takes on to the questions listed by Shadi.

Most suggestions in the DoC look quite sensible, many can be implemented and will improve the text. The elephant in the room (as I see it) is still the complete and utter absence in WCAG-EM of any specific instructions as to *how to test* the sampled web content, but this is just the result of the strategic decision to refer to the WCAG Techniques and not bother beyond that how evaluators will check and arrive at pass/fail judgements. 

Marijke von Grafhorst has made a point which we at BITV-Test, having carried out hundreds of web site evaluations, cannot underline too strongly: 

   "A year of practice showed that it is impossible to get a 100% result, so this 
   methodology requires something that can't be realized." (ID 47)

I would actually put that somewhat differently: it is indeed possible to design a 100% pass site but this creature is *very* rare in the web out there. The 'margin of error' that Marijke asks for is, as we have seen in countless discussions, anathema for the purists among us who fear that accepting marginal errors is fraught with problems: that it carries too much subjective judgement, that it is the thin end of a wedge in that it will communicate to site owners that they may be free to neglect some aspects and still get away with it, etc, etc. All this is true, of course, but the perennial issue remains that faced with generally good but less than perfect content, the evaluator has to make a decision whether to call it a pass (allowing for a margin of error without being explicit about it) or, judging strictly, call it a fail (often leading to the problem also noted by Marijke in ID 59 that "it is possible to get a lower score with a more accessible website, because the Success Criteria that are not met are less essential").

We have suggested that one of the solutions to the problem of overly lenient vs. overly harsh judgement in the face of the typical "less than perfect" content out there is another approach to rating (graded rating) - see this paper http://www.w3.org/WAI/RD/2011/metrics/paper7/

We believe that this approach leads to more accurate metrics which are less sensitive to individual testers' judgements because they allow for intermediate and thereby often more accurate ratings. This goes to account for Miranda Mafe's observation in ID 13:

   "Once two websites pass all the success criteria, they both have the same (highest) 
   score regardless of whether one is an extremely accessible website, while the other 
   has only done the bare minimum of what is necessary."

This is a rare case, but this sameness would not occur in a rating approach where one could distinguish between 'pass', and, say, 'near pass'. Much more frequent will be the case where many small issues leading to many fails in SC despite low impact lead to worse results in terms of "SC passed" than a site with very significant / critical issues affecting fewer SCs (as noted by Marijke). In a graded rating approach, the low impact issues might receive a 'near pass' which aggregated, might weigh less than fewer 'fail' judgements.

Another huge and completely unacknowledged problem in WCAG EM is that evaluators get no guidance as to how to approach the severe Success Criteria granularity problem especially in SC "1.3.1 Info and relationships", but also in 1.1.1 "Non-Text Content". The issue is that a large part of all accessibility problems generally found in evaluations belong to these umbrella success criteria 1.3.1 and 1.1.1. 

WCAG-EM makes no attempt to explain how to separately address the numerous aspects to be looked at in an umbrella SC like 1.3.1 Info and Relationships. If some issues are found, as is likely on any moderately complex site, does that mean that SC 1.3.1 should be rated a fail altogether, despite, say, a perfect heading hierarchy, beautifully marked-up data tables, etc? Technically, one paragraph indented with blockquote for presentational reasons would be enough to fail SC 1.3.1 for that page (see F43). This would be perverse, as many will agree; but if so, where exactly should evaluators draw the line in their ratings?  The example goes to illustrate that there is at least as much subjective judgement in using pass/fail than there is in using a graded approach. 

Objectivity - a concept missed by Kerstin Probiesch in ID 73 - is only to be had through a method that would spell out for all (or most) foreseeable cases how to rate a certain type of implementation a 'pass' or a 'fail', based on some consensus that takes impact into account. Is a skipped heading level, are inconsistencies in the hierarchy enough to fail a page? Unless there is a procedure that spells out those common cases and advises on how they should be rated, I see little chance of evaluators miraculously arriving at the same pass/fail judgement. And I am not making these up as tricky edge cases for applying the methodology - these are bread-and-butter questions coming up in nearly every evaluation of web sites.

As I see it, WCAG-EM helps setting the scope and identifying the sample - so much, so good. The step 4 that covers the actual evaluation and rating of content is just an empty shell. 
The different evaluators and organisations who engaged in EVAL TF will apply their own hands-on approaches. I am curious to see how closely results will match.

Now some quick input to Shadi's questions:


> Please also note some of the highlighted questions for discussion:
> * Mobile websites and applications - How well do we address these?
As step 4 refers to Techniques and available WCAG Techniques which do not specifically cover mobile yet, it depends on the experience and procedures evaluators will have at hand to test mobile. Not much can be added within the current WCAG-EM design that would improve matters.

> * Aggregation score - pros, cons, and possible ways forward?
The scores explained are implicit in the pass/fail rating across the sample - they add no magic beyond the bare sums per page or per site. As "unit-less coefficient" as Jason calls them in ID 27, they work better than nothing. I don't mind all manner of warnings around them. If missing, anyone can base their own stats on the page by page vs. SC results, so it does not really matter whether they stay in.

> * Sampling sizes - can we provide any guidance or indications?
Some people in the DoC warn that sample sizes get too large, others want to promote checking all pages if possible. In our experience, for cost reasons clients want a manageable number of pages and are not willing to pay for dozens. With higher numbers, there is more redundancy and evaluator tedium, it's a case of diminishing returns. The risk for a methodology prescribing/leading to a very high number of pages in the sample is that it may be ignored; also, clients that would go for a test were it not overly expensive might be put off altogether, which is a shame. The important thing of a11y testing is to involve clients so that they begin to see it as an aspect *they* want to take seriously. A lower barrier in terms of cost / sample size therefore has an advantage. I see it as a a trade-off between keeping things manageable and low-cost, and risking that some issues may not be covered in an evaluation. 
> * Inter-linking - can we merge sections 2 and 3 to avoid overlap?
I see no comments that specifically address overlap?

> * Documentation - how can we improve the guidance we provide?
See my long rant above. I have understood that WCAG-EM on its own will not provide the granularity and advice needed to carry out an evaluation. It assumes that evaluators will already know how to check and how to rate. It is in that sense not self-contained. I am not sure myself if there is any meaningful place for more guidance on how to conduct the numerous checks that will make up an evaluation. 
> 
> Feel free to send comments and thoughts to the list.
> 
> Regards,
>  Shadi
> 
> -- 
> Shadi Abou-Zahra - http://www.w3.org/People/shadi/
> Activity Lead, W3C/WAI International Program Office
> Evaluation and Repair Tools Working Group (ERT WG)
> Research and Development Working Group (RDWG)

-- 
Detlev Fischer
testkreis - das Accessibility-Team von feld.wald.wiese
c/o feld.wald.wiese
Thedestraße 2
22767 Hamburg

Tel   +49 (0)40 439 10 68-3
Mobil +49 (0)1577 170 73 84
Fax   +49 (0)40 439 10 68-5

http://www.testkreis.de
Beratung, Tests und Schulungen für barrierefreie Websites
Received on Sunday, 16 March 2014 14:07:38 UTC