Re: some comments/questions on techniques instructions document for submitters from Detlev Fischer on 2011-08-22 (public-wai-evaltf@w3.org from August 2011)

From: Detlev Fischer <fischer@dias.de>
Date: Mon, 22 Aug 2011 12:28:17 +0200
To: public-wai-evaltf@w3.org
Message-ID: <4E522F41.6050301@dias.de>
Hi EVAL TF list, Dennis and Leonie,

I think looking at what Denis and Leonie have described (see below), 
there may be no so much disagreement regarding the atomic level of 
assessment. I was not suggesting a high-level approach that has a tester 
just cursorily glance at a page and pick a few things he or she notices. 
It is much more thorough than that, and I entirely agree that it has to be.

What I was suggesting when talking about aggregating atomic tests is 
that testing (in our case) will be guided by the instances where a 
particular SC is applicable. It will process the atomic requirements not 
mechanically but in conjunction along those instances.

I hope it is OK to explain a little more what I mean.

Our web-based testing application starts with page sample selection. I 
pass over the details of this for now (especially the different dynamic 
states of pages that need to be covered complicate the matter).

Lets again take SC 1.1.1 as an example. While in our test procedure SC 
1.1.1. is divided into four individual checkpoints, the practical 
starting point will be the list of images displayed via the 'list 
images' function of the Web Accessibility Toolbar (WAT).

For each image, checks would then move between the WAT list display and 
the page context and apply atomic checks up to the point where an image 
instance is 'saturated'. Atomic checks are, for example:

* Is the alt attribute missing altogether? (F65)
* If the alt attribute exists and is empty:
   - is the image merely decorative? (F39)
   - alternatively, is the image with empty alt part of a link with
     link text that sufficiently describes the destination? (H2)
   - or is it part of a group of images where one alt text does the
     job for the rest? (G196)
* Is the alt text adequate for content or function, including long
   descriptions in the image context (longdesc we find rarely used in
   practice)? (F30, F67, G95, G94, G92, G100)
* If images change (e.g. in animated merry-go-rounds or to reflect
   state changes) does the alt text change too? (F20)
* If images do no appear in the WAT listing (i.e. they are background
   images), do they convey important information? (F3)
   - If so, do bg images repace text and is the text accessible (i.e.
     not hidden with display: none) and adequate?

The point is that not *all* the atomic tests need to be applied to each 
image since one (or a few of them) may progressively saturate the 
instance. The order in which atomic checks are applied is suggested by 
the instance itself and runs until saturation. That way, we are not 
"forgetting things along the way" as Denis feared.

If, for example, an alt text is present, there is of course no need for 
atomic checks for missing or empty alt text. If we then find that the 
alt text of an image describes the image content, we still need to look 
at the page context. If the image acts as a teaser and links to another 
page with no further link text in the same  link, we now know that the 
alt text, while adequate for the image content, still does not reflect 
the function. This will then be noted as deficiency in our comments 
field (one per page and checkpoint) and will be reflected in the rating 
(one rating per checkpoint and page, on a range of five steps from a 
clear 'pass' to a clear 'fail').

So here, inmidst the evaluation, is the place where the intelligence and 
judgement of the tester comes in. A page with a number of images will 
often not be perfect in the sense that every single image has the 
perfect alt text. Commonsense dictates that a page where we find a 
single issue with, say, the alt text of just one out of a number of 
teaser images, will on the aggregated page level not be ranked as badly 
as, say, a page with a main navigation menu made up from images, all 
with empty alt attributes.

This is why, on the level of aggregated ranking, our test procedure 
allows for a differentiated score:
* pass
* mostly compliant
* partly compliant
* mostly not compliant
* fail

- and the option of a mark-down of the entire site as 'badly accessible' 
for critical failures

The procedure is described in more detail here:
http://www.bitvtest.eu/bitv_test/intro/overview.html

The comments recorded for each test step per page provide the rationale 
for the chosen ranking. BTW, the evaluation report is automatically spit 
out as PDF at the end.

What the report will *not* contain is an exhaustive list of each checked 
instance on the page and its corresponding ranking. We believe such 
extensive documentation would be overwhelming and of little use to 
customers.

Of course, there can be other methods of arriving at an aggregate 
ranking. I believe UWEM 1.0 did an instance-by-instance ranking (just 
pass or fail, no intermediate steps) and calculated the overall score 
per page from that. The issue we would have with this is that the 
practical impact of a fail instance ranges from utterly critical to 
negligible, and a sensible aggregate score must reflect this impact. And 
then, there are cases (alt texts are a good example) where the solution 
is not ideal but it is neither a clear fail not a clear pass.

So human judgement is indispensible here. To mitigate the potential for 
errors and oversights, our final compliance tests always have two 
independent testers that go through an arbitration phase at the end, 
arriving at a consensus result.

So much for now,
Detlev

Am 21.08.2011 20:25, schrieb Léonie Watson:
> Denis Boudreau wrote:
> "While I agree with Detlev on some level, I do not believe we can be thorough and confident to cover all the related techniques and failures associated to a specific success criterion while auditing if we do not go down to that atomic level."
>
> 	Unless an assessment goes down to that atomic level, I believe it's vulnerable to inconsistency. High level evaluations have a wide margin for interpretation. The methodology must be possible to (easily) apply consistently.
>
> Denis Boudreau wrote:
> "It may look like a lot of tests at first, but it turns out that it's not so bad because we never audit every page on a website, but rather pick a set of representative pages based on various templates. So in the worst cases, we rarely end up with more than 12 pages to audit."
>
> 	We do something very similar. We manually evaluate the representative sample of pages using the atomic tests, then run higher level automated tests across the whole of the website (or at least a thousand pages). We then follow up with a more heuristic evaluation using different access technologies.
>
> Denis Boudreau wrote:
> "For us at least, web accessibility auditing is always at least a two-phase process: a first assessment of what's out there and another one, after the recommendations have been put in place, to see how well the developers did. And so we also came to realize that the best way to ensure people would fix all pages and not only the ones that were audited was simply to retain, on the second round of evaluation, about 60% of the pages that were first audited and then go pick a few new ones just to see if they measure up with the ones that were fixed."
>
> 	Again, this is remarkably similar to our process. We don't usually take the approach of selecting a few new pages when it comes to the retest, but it's a brilliant idea!
>
> Denis Boudreau wrote:
> "All in all, we usually plan about two hours per page audited, screen reader testing included. We feel an audit cannot be considered complete without combining both those checklists and user testing. So a 10 page evaluation would require anywhere between say, 15 to 20 hours of work per round. I'm very curious/interested to compare these numbers with what you folks currently do."
>
> 	We tend to work in days rather than hours, but our estimates come  out about the same in the end I think. The biggest challenge for us is writing up the results into a meaningful report. Finding the balance between informative and information overload is often quite troublesome.
>
> Léonie.
>
>
>
> -----Original Message-----
> From: public-wai-evaltf-request@w3.org [mailto:public-wai-evaltf-request@w3.org] On Behalf Of Denis Boudreau
> Sent: 20 August 2011 15:02
> To: Eval TF
> Cc: WCAG WG
> Subject: Re: some comments/questions on techniques instructions document for submitters
>
> Good morning everyone,
>
> I guess this is a good opportunity to dive right in the EVAL TF work and share a bit of our experience with methodology.
>
> While I agree with Detlev on some level, I do not believe we can be thorough and confident to cover all the related techniques and failures associated to a specific success criterion while auditing if we do not go down to that atomic level. Grouping different elements together to limit the number of tests will make it easy on the auditor, no doubt about that, but in my humble opinion, would naturally lead the forgetting things along the way.
>
> The example if SC 1.1.1 is great because of the quantity of elements to look for, and so would be 1.3.1. When there are so many things to look out for, it's easy to either forget one or feel overwhelmed by the quantity. But on the other hand, this is just the reality of accessibility testing.
>
> When we do WCAG 2.0 assessment work at the office, we go over a series of 170 atomic tests for all 61 SC, divided like so:
>
> * 105 tests for WCAG 2.0 A
> * 27 tests for WCAG 2.0 AA (for a total of 132 tests for lvl A and AA)
> * 38 tests for WCAG 2.0 AAA (for a total of 170 tests for all three levels of conformance)
>
> This means that we've broken down each and every criterion into a list of things to look out for. Those checklists come from either the techniques and failures, or from experience encountering accessibility barriers using various assistive technologies. For example, for SC 1.1.1 alone, we end up with 24 individual tests. Some of them are made using various browsers extensions in IE of Firefox, but a significant number have to be verified manually (SC 1.4.3 for images naturally come to mind, as would SC 1.4.8 or SC 2.1.2 for instance).
>
> It may look like a lot of tests at first, but it turns out that it's not so bad because we never audit every page on a website, but rather pick a set of representative pages based on various templates. So in the worst cases, we rarely end up with more than 12 pages to audit. This selection is usually build up with:
>
> * the homepage
> * various section level homepages
> * various inside pages that present a lot of diverse content (headings, lists, paragraphs and so on)
> * at least one page containing a reasonably sized form (if any)
> * at least one page containing a reasonably sized data table (if any)
> * the site map
>
> With time, we came to realize that doing more was unnecessary, because what people need is not a site wide diagnosis of their website accessibility, but rather some recommendations as to how to improve what's already there. By insisting on a limited set of representative pages and making sure the developers apply the proper corrections across all pages, we can get to pretty satisfying results without having to resort to full blown auditing, which would require an insanely huge amount of time form our part, not to mention sky-rocketing costs.
>
> For us at least, web accessibility auditing is always at least a two-phase process: a first assessment of what's out there and another one, after the recommendations have been put in place, to see how well the developers did. And so we also came to realize that the best way to ensure people would fix all pages and not only the ones that were audited was simply to retain, on the second round of evaluation, about 60% of the pages that were first audited and then go pick a few new ones just to see if they measure up with the ones that were fixed.
>
> All in all, we usually plan about two hours per page audited, screen reader testing included. We feel an audit cannot be considered complete without combining both those checklists and user testing. So a 10 page evaluation would require anywhere between say, 15 to 20 hours of work per round. I'm very curious/interested to compare these numbers with what you folks currently do.
>
> Anyway, here's for a first message to this list, it's already long enough.
>
> Best,
>
> --
> Denis Boudreau, président
> Coopérative AccessibilitéWeb
> 1751 rue Richardson, bureau 6111
> Montréal (Qc), Canada H3K 1G6
> Téléphone : +1 877.315.5550
>
> ----------------------------------------------------
> |	** a11yMTL 2011 - plus que 6 jours! **	|
> |	* Tous les détails au www.a11ymtl.org *	|
> ----------------------------------------------------
>
>
>
>
> On 2011-08-20, at 5:01 AM, Shadi Abou-Zahra wrote:
>
>> Dear Tim, Detlev,
>>
>> On 19.8.2011 19:50, Boland Jr, Frederick E. wrote:
>>> Thanks for your insightful comments.  I think they are worthy of serious consideration.
>>> My thoughts as you suggest were just as an input or starting point to
>>> further discussion on this topic.  Perhaps as part of the work of the
>>> EVAL TF we can come up with principles or characteristics of how an evaluation should be performed..
>>
>> Yes, I agree that this is a useful discussion to have in Eval TF, and bring back consolidated suggestions to WCAG WG.
>>
>>
>>> Thanks and best wishes
>>> Tim Boland NIST
>>>
>>> PS - is it OK to post this discussion to the EVAL TF mailing list (it
>>> might be useful  information for the members of the TF)?
>>
>> Yes it is. I have CC'ed Eval TF.
>>
>> Best,
>>   Shadi
>>
>>
>>> -----Original Message-----
>>> From: w3c-wai-gl-request@w3.org [mailto:w3c-wai-gl-request@w3.org] On
>>> Behalf Of Detlev Fischer
>>> Sent: Friday, August 19, 2011 12:14 PM
>>> To: w3c-wai-gl@w3.org
>>> Subject: Re: some comments/questions on techniques instructions
>>> document for submitters
>>>
>>> Hi Tim Borland,
>>>
>>> EVAL TF has just started so I went back to the level of atomic tests
>>> to see what their role might be in a practical accessibility
>>> evaluation approach.
>>>
>>>    Atomic tests limited to a specific technique are certainly useful
>>> as a heuristic for implementers of such a technique to check whether
>>> they have implemented it correctly, and the points in the techniques
>>> instructions as well as your points on writing a 'good test' are
>>> therefore certainly valid on this level.
>>>
>>> However, any evaluation procedure checking conformance of content to
>>> particular SC criteria needs to consider quite a number of techniques
>>> in conjunction. The 'complication' you mention can be avoided on the
>>> level of technique, not any longer on the level of SC.
>>>
>>> Stating conformance to a particular SC  might involve a large number
>>> of techniques and failures, some applied alternatively, others in
>>> conjunction. For example, checking for compliance of all page content
>>> to SC 1.1.1 (Non-Text Content), any of the following 15 techniques
>>> and failures might be relevant: G95, G94, G100, G92, G74, G73, G196,
>>> H37, H67, H45, F67, F3, F20, F39, F65. And this does not even include
>>> the techniques which provide accessible text replacements for background images.
>>>
>>> My belief is that in *practical terms*, concatenating a large number
>>> of partly interrelated atomic tests to arrive at a SC conformance
>>> judgement is just not a practical approach for human evaluation. If
>>> we want a *usable*, i.e., manageable procdure for a human tester to
>>> check whether the images on a page have proper alternative text, what
>>> *actually* happens is more something like a pattern matching of known
>>> (recogniszed)
>>> failures:
>>>
>>> * Display all images together with alt text (and, where available,
>>> href)
>>> * Scan for instances of known failures - this also needs
>>>     checking the image context for cases like G74 and G196
>>> * Render page with custom colours (images now disappear) and check
>>>     whether text replacements for background images are displayed
>>>
>>> Moreover, if the *severity* of failure needs to be reflected in the
>>> conformance claim or associated tolerance metrics, then the failure
>>> to provide alt text for a main navigation item or graphical submit
>>> button must not be treated the same way as the failure to provide alt
>>> on some supporter's logo in the footer of the page.
>>>
>>> My point is that while I am all for precision, the requirements for a
>>> rather complex integrated human assessment of a multitude of
>>> techniques and failures practically rule out an atomic approach where
>>> each applicable test of each applicable technique is carried out
>>> sequentially along the steps provided and then processed according to
>>> the logical concatenation of techniques given in the "How to meet"
>>> document. It simpy would be far too cumbersome.
>>>
>>> I realise that you have not maintained that evaluation should be done
>>> that way - I just took your thoughts as a starting point. We have
>>> only just started with the EVAL task force work - I am curious what
>>> solutions we will arrive at to ensure rigor and mappability while
>>> still coming up with a manageable, doable approach.
>>>
>>> Regards,
>>> Detlev
>>>
>>> Am 05.08.2011 16:28, schrieb Boland Jr, Frederick E.:
>>>> For
>>>>
>>>> http://www.w3.org/WAI/GL/wiki/Technique_Instructions
>>>>
>>>> General Comments:
>>>>
>>>> Under "Tests" should there be guidance on limiting the number of
>>>> steps in a testing procedure (not making tests too involved)?
>>>>
>>>> (this gets to "what makes a good test"?
>>>>
>>>> In .. http://www.w3.org/QA/WG/2005/01/test-faq#good
>>>>
>>>> "A good test is:
>>>>
>>>>    * Mappable to the specification (you must know what portion of the
>>>>      specification it tests)
>>>>    * Atomic (tests a single feature rather than multiple features)
>>>>    * Self-documenting (explains what it is testing and what output it
>>>>      expects)
>>>>    * Focused on the technology under test rather than on ancillary
>>>>      technologies
>>>>    * Correct "
>>>>
>>>> Does the information under "Tests" clearly convey information in
>>>> these items to potential submitters?
>>>>
>>>> Furthermore, do we want to have some language somewhere in the
>>>> instructions that submitted techniques should not be too "complicated"
>>>> (should just demonstrate simple features or atomic actions if possible)?
>>>>
>>>> Editorial Comments:
>>>>
>>>> under "Techniques Writeup Checklist "UW2" should be expanded to
>>>> "Understanding WCAG2"
>>>>
>>>> 3^rd bullet under "applicability" has lots of typos..
>>>>
>>>> Thanks and best wishes
>>>>
>>>> Tim Boland NIST
>>>>
>>>
>>>
>>
>> --
>> Shadi Abou-Zahra - http://www.w3.org/People/shadi/ Activity Lead,
>> W3C/WAI International Program Office Evaluation and Repair Tools
>> Working Group (ERT WG) Research and Development Working Group (RDWG)
>>
>
>
>


-- 
---------------------------------------------------------------
Detlev Fischer PhD
DIAS GmbH - Daten, Informationssysteme und Analysen im Sozialen
Geschäftsführung: Thomas Lilienthal, Michael Zapp

Telefon: +49-40-43 18 75-25
Mobile: +49-157 7-170 73 84
Fax: +49-40-43 18 75-19
E-Mail: fischer@dias.de

Anschrift: Schulterblatt 36, D-20357 Hamburg
Amtsgericht Hamburg HRB 58 167
Geschäftsführer: Thomas Lilienthal, Michael Zapp
---------------------------------------------------------------
Received on Monday, 22 August 2011 10:30:09 UTC