Re: some initial questions from the previous thread from Detlev Fischer on 2011-08-24 (public-wai-evaltf@w3.org from August 2011)

From: Detlev Fischer <fischer@dias.de>
Date: Wed, 24 Aug 2011 13:30:01 +0200
To: public-wai-evaltf@w3.org
Message-ID: <4E54E0B9.9070402@dias.de>
Am 24.08.2011 11:54, schrieb Shadi Abou-Zahra:
> Hi Detlev,
>
> On 24.8.2011 10:43, Detlev Fischer wrote:
>> Am 24.08.2011 02:11, schrieb Vivienne CONWAY:
>>> Thanks Detlev.
>>>
>>> As I will be looking at these sites regularly (primarily with automated
>>> tools due to the number of sites), what is puzzling me most is how to
>>> actually score them. While I like the pass/fail/near approach for the
>>> site owner's use, to compare them I need percentages. Such as:
>>> P: 65%
>>> O: 80%
>>> U: 25%
>>> R: 90%
>>> Overall: 65%
>>>
>> Confronted with a large nuber of sites, one solution is of course to use
>> automated toos. The problem, as we all know, is that many serious
>> problems are not caught by them, and in turn, the percentage values
>> suggest a kind of accuracy that is not really backed by the full
>> evidence but simply based on just those issues amenable to automatic
>> testing.
>>
>> We know many people are quite happy when they get a nice score or chart,
>> they don't understand or even want to know how shaky these may be. I
>> just think that *if* you can influence *how* some aggregate score is
>> computed, the guiding principle would be for it to reflect the actual
>> difficulty people have in accessing a site.
>>
>> Maybe it's a corny example but let's compare a site to a car. You can
>> make all sorts of checks, engine and breaks and mirrors and body and
>> interior are all fine, but if one critical thing (steering,
>> transmission) fails, you can't use the car. (Think of visual CAPTCHA for
>> blind people, or keyboard trap for keyboard users). The risk is that
>> *any* aggregation of scores, even weighted scores, will 'drown' critical
>> failures.
>>
>> So I believe that if there is no time for detailed testing, testing for
>> critical failures is still more relevant than creating an automated test
>> score. If you have both, even better.
>>
>> If it was indeed possible to decide on a limited number of
>> evidence-based critical failures (say, 10), i.e., frequently observed
>> aspects without which a site would be unusable or very hard to use for
>> some populations, you could probably also compile a numeric outcome,
>> summing up 1 or 0 per issue covered. In this case however, if you'd have
>> less than 10 out of 10, its time for service... It's rough and simple,
>> but that reflects the coarseness of approach and seems therefore
>> adequate.
>
> Interesting thought.
>
> Besides the difficulty of defining "critical failures", could this on
> the long-run lead to developers only aiming to fulfill these minimum
> issues and leave out other important ones?

Yes, I think that is a risk. How much will developers adapt to an 
ecosystem where only critical failures are noted? However, if a full 
scale evaluation must be ruled out due to time or humanpower constrains, 
a menaningful 'quick check' is still better than one that misses 
critical issues. (BTW, developers may also adapt to automatic testing 
ecosystem: ensure stuff validates, alt is present even if meaningless, 
headings are nested without gap even if counter-intuitive, etc)
>
> Can the set of "critical failures" be defined to be WCAG .2.0 Level A?

Well, it would be nice to use Level A and on the whole, it may work. It 
would also be better because it does not create additional schemes 
beyond or above WCAG. However, it appears to me that some A-level 
criteria such as 1.4.1 Use of Color are in practice usually not *that* 
critical, while some AA-Level criteria are absolutely critical, such as 
2.4.7 Focus Visible for sighted keyboard users. (Of course, as in other 
cases, workarounds often exist: users may choose another UA or an add-on 
that highlights focus regardless of CSS rules; CAPTCHAS can be solved 
with Webvisum, etc.)
>
>
>> On another note, I also wonder what people would do with the break-down
>> of percentages in the POUR schema. On some level, the results should be
>> actionable and to be so, it would be nice to be be able to point to the
>> most glaring problems. Maybe it is more useful to know what section of
>> the population with disabilites is badly served. So that would suggest a
>> scheme where you group things by
>>
>> M: Motor impairment
>> B: Blindness
>> V: Visual impairment
>> H: Hearing impairment
>> C: Cognitive impairment
>>
>> Success criteria may then be allocated to M,B,V,H,C and you would have
>> many double allocations, for example, between M and B.
>>
>> Maybe critical failures could be allocated to populations (incl. double
>> counts). I just run through the improvised list of 10 critical failures
>> I made up earlier and add up:
>>
>> M: 4
>> B: 6
>> V: 4
>> H: 2
>> C: 2
>>
>> Whether that kind of result would be more meaningful I am not really
>> sure about. At least if one of the groups is really badly served, their
>> associations and interest groups would have better evidence when they
>> campaign for improvements.
>
> I always feel uncomfortable using "categories of people" as it risks
> missing individuals with less frequent profiles (type of disability), or
> forcing individuals into performa categories.

Yes, I see what you mean. Maybe there are other ways to organise by type 
of impairment that don't equate impairment with a particular group of 
people, like a filter metaphor that would also do justice to multiple 
impairments.

It also risks user
> representations campaigning against each other rather than together.
Is there evidence for this in the past? I doubt that a differentiated 
presentation would invite groups to campaign against each other. The 
criteria often are a common interest, sometimes complementary, and 
rarely the stuff of conflict.


>
> Best,
> Shadi
>
>
>> Regards, Detlev
>>
>>> Problem is, how to work out that percentage. I could use number of
>>> violations/number of pages checked. However this does not weight the
>>> more critical errors - like the ones you cited. I could work out
>>> some kind of algorithm where violations of the critical issues were
>>> say 1.5:1, items such as non-critical validation errors were .5:1
>>> or something similar. Thoughts?
>>>
>>>
>>> Regards
>>>
>>> Vivienne L. Conway
>>> ________________________________________
>>> From: public-wai-evaltf-request@w3.org
>>> [public-wai-evaltf-request@w3.org] On Behalf Of fischer@dias.de
>>> [fischer@dias.de]
>>> Sent: Tuesday, 23 August 2011 11:04 PM
>>> To: public-wai-evaltf@w3.org
>>> Subject: RE: some initial questions from the previous thread
>>>
>>> Quoting Vivienne CONWAY<v.conway@ecu.edu.au>:
>>>
>>>> HI all
>>>> Just thought I'd weigh in on this one as I'm currently puzzling over
>>>> the issue of how to score websites. I'm just about to start a
>>>> research project where I'll have over 100 websites assessed monthly
>>>> over a period of 2 + years.
>>>
>>> If you will be doing this on your own or without team this work
>>> programme translates to checking more than 4-5 sites per day! And if
>>> the compliance level is AA you probably need to focus on some key
>>> requirements, especially those where a failure would make a site
>>> completely inaccessible to some population. Just looking at WCAG
>>> success criteria, these may be the ones which most often exclude
>>> people, ordered by importance from testing experience(feel free to
>>> disagree):
>>>
>>> * Lack of keyboard accessibility (SC 2.1.1, 2.1.2)
>>> * Important images like controls without alt text (1.1.1)
>>> * CAPTCHAs w/o alternative (SC 1.1.1)
>>> * Lack of captions in videos (SC 1.2.2, 1.2.4)
>>> * Really low contrast of text (SC 1.4.3)
>>> * Bad or no visibility of focus (SC 2.4.7)
>>> * Important controls implemented as background image without text
>>> replacement (SC 1.1.1)
>>> * Important fields (such as search text input) w/o labels (SC 2.4.6)
>>> * lack of structure (e.g. no or inconsistent headings) (SC 1.3.1)
>>> * Self-starting / unstoppable animation, carussels, etc (SC 2.2.1,
>>> 2.2.2)
>>>
>>> Well, having written this, it may seem a bit arbitrary - but I believe
>>> the list has many or most of the grave errors that we encounter in
>>> testing.
>>>
>>> If there was a statistic on "show stoppers" things that make sites
>>> inaccessible or impede access severely, such an approach had a better
>>> basis, of course...
>>>
>>> Just my 2 cents,
>>> Detlev
>>>
>>>
>>> ) that can be tested relatively quickly and without going onto too
>>> much detail.
>>>
>>> I think as long as the method is transparent, / documented and its
>>> limitations are clearly stated, the results can still be valuable. I
>>> need to come up with a scoring method
>>>> (preferably a percentage) due to the need to compare a website
>>>> within those of its own classification (e.g. federal government,
>>>> corporate, etc), and compare the different classifications. I am
>>>> thinking of a method where the website gets a percentage score for
>>>> each of the POUR principles, and then an overall score. What I'm
>>>> strugling with is what scoring method to use and how to put
>>>> different weights upon different aspects and at different levels.
>>>> I'll be assessing to WCAG 2.0 AA (as that's the Australian
>>>> standard). All input and suggestions are gratefully accepted and
>>>> may also be useful to our discussions here as it's a real-life
>>>> situation for me. It also relates to may of the questions raised in
>>>> this thread by Shadi. Looking forward to some interesting discussion.
>>>>
>>>>
>>>> Regards
>>>>
>>>> Vivienne L. Conway
>>>> ________________________________________
>>>> From: public-wai-evaltf-request@w3.org
>>>> [public-wai-evaltf-request@w3.org] On Behalf Of Shadi Abou-Zahra
>>>> [shadi@w3.org]
>>>> Sent: Monday, 22 August 2011 7:34 PM
>>>> To: Eval TF
>>>> Subject: some initial questions from the previous thread
>>>>
>>>> Dear Eval TF,
>>>>
>>>> From the recent thread on the construction of WCAG 2.0 Techniques, here
>>>> are some questions to think about:
>>>>
>>>> * Is the "evaluation methodology" expected to be carried out by one
>>>> person or by a group of more than one persons?
>>>>
>>>> * What is the expected level of expertise (in accessibility, in web
>>>> technologies etc) of persons carrying out an evaluation?
>>>>
>>>> * Is the involvement of people with disabilities a necessary part of
>>>> carrying out an evaluation versus an improvement of the quality?
>>>>
>>>> * Are the individual test results binary (ie pass/fail) or a score
>>>> (discrete value, ratio, etc)?
>>>>
>>>> * How are these test results aggregated into an overall score (plain
>>>> count, weighted count, heuristics, etc)?
>>>>
>>>> * Is it useful to have a "confidence score" for the tests (for example
>>>> depending on the degree of subjectivity or "difficulty")?
>>>>
>>>> * Is it useful to have a "confidence score" for the aggregated result
>>>> (depending on how the evaluation is carried out)?
>>>>
>>>>
>>>> Feel free to chime in if you have particular thoughts on any of these.
>>>>
>>>> Best,
>>>> Shadi
>>>>
>>>> --
>>>> Shadi Abou-Zahra - http://www.w3.org/People/shadi/
>>>> Activity Lead, W3C/WAI International Program Office
>>>> Evaluation and Repair Tools Working Group (ERT WG)
>>>> Research and Development Working Group (RDWG)
>>>>
>>>> This e-mail is confidential. If you are not the intended recipient
>>>> you must not disclose or use the information contained within. If
>>>> you have received it in error please return it to the sender via
>>>> reply e-mail and delete any record of it from your system. The
>>>> information contained within is not the opinion of Edith Cowan
>>>> University in general and the University accepts no liability for
>>>> the accuracy of the information provided.
>>>>
>>>> CRICOS IPC 00279B
>>>>
>>>>
>>>
>>> This e-mail is confidential. If you are not the intended recipient you
>>> must not disclose or use the information contained within. If you have
>>> received it in error please return it to the sender via reply e-mail
>>> and delete any record of it from your system. The information
>>> contained within is not the opinion of Edith Cowan University in
>>> general and the University accepts no liability for the accuracy of
>>> the information provided.
>>>
>>> CRICOS IPC 00279B
>>>
>>
>>
>


-- 
---------------------------------------------------------------
Detlev Fischer PhD
DIAS GmbH - Daten, Informationssysteme und Analysen im Sozialen
Geschäftsführung: Thomas Lilienthal, Michael Zapp

Telefon: +49-40-43 18 75-25
Mobile: +49-157 7-170 73 84
Fax: +49-40-43 18 75-19
E-Mail: fischer@dias.de

Anschrift: Schulterblatt 36, D-20357 Hamburg
Amtsgericht Hamburg HRB 58 167
Geschäftsführer: Thomas Lilienthal, Michael Zapp
---------------------------------------------------------------
Received on Wednesday, 24 August 2011 11:30:25 UTC