Re: some initial questions from the previous thread from Detlev Fischer on 2011-08-24 (public-wai-evaltf@w3.org from August 2011)

From: Detlev Fischer <fischer@dias.de>
Date: Wed, 24 Aug 2011 10:43:53 +0200
To: public-wai-evaltf@w3.org
Message-ID: <4E54B9C9.4080501@dias.de>
Am 24.08.2011 02:11, schrieb Vivienne CONWAY:
> Thanks Detlev.
>
> As I will be looking at these sites regularly (primarily with automated
> tools due to the number of sites), what is puzzling me most is how to
> actually score them.  While I like the pass/fail/near approach for the
> site owner's use, to compare them I need percentages.  Such as:
> P: 65%
> O: 80%
> U: 25%
> R: 90%
> Overall: 65%
>
Confronted with a large nuber of sites, one solution is of course to use 
automated toos. The problem, as we all know, is that many serious 
problems are not caught by them, and in turn, the percentage values 
suggest a kind of accuracy that is not really backed by the full 
evidence but simply based on just those issues amenable to automatic 
testing.

We know many people are quite happy when they get a nice score or chart, 
they don't understand or even want to know how shaky these may be. I 
just think that *if* you can influence *how* some aggregate score is 
computed, the guiding principle would be for it to reflect the actual 
difficulty people have in accessing a site.

Maybe it's a corny example but let's compare a site to a car. You can 
make all sorts of checks, engine and breaks and mirrors and body and 
interior are all fine, but if one critical thing (steering, 
transmission) fails, you can't use the car. (Think of visual CAPTCHA for 
blind people, or keyboard trap for keyboard users).  The risk is that 
*any* aggregation of scores, even weighted scores, will 'drown' critical 
failures.

So I believe that if there is no time for detailed testing, testing for 
critical failures is still more relevant than creating an automated test 
score. If you have both, even better.

If it was indeed possible to decide on a limited number of 
evidence-based critical failures (say, 10), i.e., frequently observed 
aspects without which a site would be unusable or very hard to use for 
some populations, you could probably also compile a numeric outcome, 
summing up 1 or 0 per issue covered. In this case however, if you'd have 
less than 10 out of 10, its time for service... It's rough and simple, 
but that reflects the coarseness of approach and seems therefore adequate.

On another note, I also wonder what people would do with the break-down 
of percentages in the POUR schema. On some level, the results should be 
actionable and to be so, it would be nice to be be able to point to the 
most glaring problems. Maybe it is more useful to know what section of 
the population with disabilites is badly served. So that would suggest a 
scheme where you group things by

M: Motor impairment
B: Blindness
V: Visual impairment
H: Hearing impairment
C: Cognitive impairment

Success criteria may then be allocated to M,B,V,H,C and you would have 
many double allocations, for example, between M and B.

Maybe critical failures could be allocated to populations (incl. double 
counts). I just run through the improvised list of 10 critical failures 
I made up earlier and add up:

M: 4
B: 6
V: 4
H: 2
C: 2

Whether that kind of result would be more meaningful I am not really 
sure about. At least if one of the groups is really badly served, their 
associations and interest groups would have better evidence when they 
campaign for improvements.

Regards, Detlev

> Problem is, how to work out that percentage.  I could use number of
> violations/number of pages checked.  However this does not weight the
> more critical errors - like the ones you cited.  I could work out
> some kind of algorithm where violations of the critical issues were
> say 1.5:1, items such as non-critical validation errors were .5:1
> or something similar.  Thoughts?
>
>
> Regards
>
> Vivienne L. Conway
> ________________________________________
> From: public-wai-evaltf-request@w3.org [public-wai-evaltf-request@w3.org] On Behalf Of fischer@dias.de [fischer@dias.de]
> Sent: Tuesday, 23 August 2011 11:04 PM
> To: public-wai-evaltf@w3.org
> Subject: RE: some initial questions from the previous thread
>
> Quoting Vivienne CONWAY<v.conway@ecu.edu.au>:
>
>> HI all
>> Just thought I'd weigh in on this one as I'm currently puzzling over
>> the issue of how to score websites.  I'm just about to start a
>> research project where I'll have over 100 websites assessed monthly
>> over a period of 2 + years.
>
> If you will be doing this on your own or without team this work
> programme translates to checking more than 4-5 sites per day! And if
> the compliance level is AA you probably need to focus on some key
> requirements, especially those where a failure would make a site
> completely inaccessible to some population. Just looking at WCAG
> success criteria, these may be the ones which most often exclude
> people, ordered by importance from testing experience(feel free to
> disagree):
>
> * Lack of keyboard accessibility (SC 2.1.1, 2.1.2)
> * Important images like controls without alt text (1.1.1)
> * CAPTCHAs w/o alternative (SC 1.1.1)
> * Lack of captions in videos (SC 1.2.2, 1.2.4)
> * Really low contrast of text (SC 1.4.3)
> * Bad or no visibility of focus (SC 2.4.7)
> * Important controls implemented as background image without text
>     replacement (SC 1.1.1)
> * Important fields (such as search text input) w/o labels (SC 2.4.6)
> * lack of structure (e.g. no or inconsistent headings) (SC 1.3.1)
> * Self-starting / unstoppable animation, carussels, etc (SC 2.2.1, 2.2.2)
>
> Well, having written this, it may seem a bit arbitrary - but I believe
> the list has many or most of the grave errors that we encounter in
> testing.
>
> If there was a statistic on "show stoppers" things that make sites
> inaccessible or impede access severely, such an approach had a better
> basis, of course...
>
> Just my 2 cents,
> Detlev
>
>
> ) that can be tested relatively quickly and without going onto too
> much detail.
>
> I think as long as the method is transparent, / documented and its
> limitations are clearly stated, the results can still be valuable. I
> need to come up with a scoring method
>> (preferably a percentage) due to the need to compare a website
>> within those of its own classification (e.g. federal government,
>> corporate, etc), and compare the different classifications.  I am
>> thinking of a method where the website gets a percentage score for
>> each of the POUR principles, and then an overall score.  What I'm
>> strugling with is what scoring method to use and how to put
>> different weights upon different aspects and at different levels.
>> I'll be assessing to WCAG 2.0 AA (as that's the Australian
>> standard).  All input and suggestions are gratefully accepted and
>> may also be useful to our discussions here as it's a real-life
>> situation for me.  It also relates to may of the questions raised in
>> this thread by Shadi.  Looking forward to some interesting discussion.
>>
>>
>> Regards
>>
>> Vivienne L. Conway
>> ________________________________________
>> From: public-wai-evaltf-request@w3.org
>> [public-wai-evaltf-request@w3.org] On Behalf Of Shadi Abou-Zahra
>> [shadi@w3.org]
>> Sent: Monday, 22 August 2011 7:34 PM
>> To: Eval TF
>> Subject: some initial questions from the previous thread
>>
>> Dear Eval TF,
>>
>>   From the recent thread on the construction of WCAG 2.0 Techniques, here
>> are some questions to think about:
>>
>> * Is the "evaluation methodology" expected to be carried out by one
>> person or by a group of more than one persons?
>>
>> * What is the expected level of expertise (in accessibility, in web
>> technologies etc) of persons carrying out an evaluation?
>>
>> * Is the involvement of people with disabilities a necessary part of
>> carrying out an evaluation versus an improvement of the quality?
>>
>> * Are the individual test results binary (ie pass/fail) or a score
>> (discrete value, ratio, etc)?
>>
>> * How are these test results aggregated into an overall score (plain
>> count, weighted count, heuristics, etc)?
>>
>> * Is it useful to have a "confidence score" for the tests (for example
>> depending on the degree of subjectivity or "difficulty")?
>>
>> * Is it useful to have a "confidence score" for the aggregated result
>> (depending on how the evaluation is carried out)?
>>
>>
>> Feel free to chime in if you have particular thoughts on any of these.
>>
>> Best,
>>     Shadi
>>
>> --
>> Shadi Abou-Zahra - http://www.w3.org/People/shadi/
>> Activity Lead, W3C/WAI International Program Office
>> Evaluation and Repair Tools Working Group (ERT WG)
>> Research and Development Working Group (RDWG)
>>
>> This e-mail is confidential. If you are not the intended recipient
>> you must not disclose or use the information contained within. If
>> you have received it in error please return it to the sender via
>> reply e-mail and delete any record of it from your system. The
>> information contained within is not the opinion of Edith Cowan
>> University in general and the University accepts no liability for
>> the accuracy of the information provided.
>>
>> CRICOS IPC 00279B
>>
>>
>
> This e-mail is confidential. If you are not the intended recipient you must not disclose or use the information contained within. If you have received it in error please return it to the sender via reply e-mail and delete any record of it from your system. The information contained within is not the opinion of Edith Cowan University in general and the University accepts no liability for the accuracy of the information provided.
>
> CRICOS IPC 00279B
>


-- 
---------------------------------------------------------------
Detlev Fischer PhD
DIAS GmbH - Daten, Informationssysteme und Analysen im Sozialen
Geschäftsführung: Thomas Lilienthal, Michael Zapp

Telefon: +49-40-43 18 75-25
Mobile: +49-157 7-170 73 84
Fax: +49-40-43 18 75-19
E-Mail: fischer@dias.de

Anschrift: Schulterblatt 36, D-20357 Hamburg
Amtsgericht Hamburg HRB 58 167
Geschäftsführer: Thomas Lilienthal, Michael Zapp
---------------------------------------------------------------
Received on Wednesday, 24 August 2011 08:44:27 UTC