Re: some initial questions from the previous thread from RichardWarren on 2011-08-23 (public-wai-evaltf@w3.org from August 2011)

From: RichardWarren <richard.warren@userite.com>
Date: Tue, 23 Aug 2011 15:14:30 +0100
To: "Vivienne CONWAY" <v.conway@ecu.edu.au>, "Shadi Abou-Zahra" <shadi@w3.org>, "Eval TF" <public-wai-evaltf@w3.org>
Message-ID: <67645D5779704E258241009CC77B0101@DaddyPC>
Hi,

For scoring we currently use
Pass - Complies with the guideline, does not disadvantage a disabled user
Fail - does not comply, presents a barrier or disadvantages some/many 
disabled users
Not Applicable - technology (such as video) is not used

And then we have a score titled "NEAR"
NEAR - A qualitative (heuristic) score indicating that some disabled people 
will find this issue, or part of the site, page or component, more difficult 
than need be, but not impossible or so frustrating that a reasonable person 
would give up and go elsewhere.
Examples of NEAR are an occasional icon such as a telephone icon called 
"phone.gif" without a text alternative but followed by a phone number. Or a 
code validation error that does not affect assistive tools such as using the 
ampersand in urls.
We tried weighting the NEAR score using % to say how near to a pass or fail 
the issue was - but it all got too complicated. So we just ask our testers 
if they noticed the error, and if so, did they find it a serious problem.

When we send the report to the client we tell them to fix the failing items 
first then to look at the NEAR items and try to fix them as and when they 
have the resources available.

For me the most important issue is that the eventual evaluation method is 
capable of producing consistent results no matter who follows it (given that 
they have sufficient/agreed level of knowledge/experience). The more complex 
the method the more difficult it will be to achieve that. There needs to be 
some leeway for common sense and practicality, but I am nervous of systems 
that introduce complex weighting or "levels of confidence".

Regards

Richard


-----Original Message----- 
From: Vivienne CONWAY
Sent: Tuesday, August 23, 2011 1:28 PM
To: Shadi Abou-Zahra ; Eval TF
Subject: RE: some initial questions from the previous thread

HI all
Just thought I'd weigh in on this one as I'm currently puzzling over the 
issue of how to score websites.  I'm just about to start a research project 
where I'll have over 100 websites assessed monthly over a period of 2 + 
years.  I need to come up with a scoring method (preferably a percentage) 
due to the need to compare a website within those of its own classification 
(e.g. federal government, corporate, etc), and compare the different 
classifications.  I am thinking of a method where the website gets a 
percentage score for each of the POUR principles, and then an overall score. 
What I'm strugling with is what scoring method to use and how to put 
different weights upon different aspects and at different levels.  I'll be 
assessing to WCAG 2.0 AA (as that's the Australian standard).  All input and 
suggestions are gratefully accepted and may also be useful to our 
discussions here as it's a real-life situation for me.  It also relates to 
may of the questions raised in this thread by Shadi.  Looking forward to 
some interesting discussion.


Regards

Vivienne L. Conway
________________________________________
From: public-wai-evaltf-request@w3.org [public-wai-evaltf-request@w3.org] On 
Behalf Of Shadi Abou-Zahra [shadi@w3.org]
Sent: Monday, 22 August 2011 7:34 PM
To: Eval TF
Subject: some initial questions from the previous thread

Dear Eval TF,

>From the recent thread on the construction of WCAG 2.0 Techniques, here
are some questions to think about:

* Is the "evaluation methodology" expected to be carried out by one
person or by a group of more than one persons?

* What is the expected level of expertise (in accessibility, in web
technologies etc) of persons carrying out an evaluation?

* Is the involvement of people with disabilities a necessary part of
carrying out an evaluation versus an improvement of the quality?

* Are the individual test results binary (ie pass/fail) or a score
(discrete value, ratio, etc)?

* How are these test results aggregated into an overall score (plain
count, weighted count, heuristics, etc)?

* Is it useful to have a "confidence score" for the tests (for example
depending on the degree of subjectivity or "difficulty")?

* Is it useful to have a "confidence score" for the aggregated result
(depending on how the evaluation is carried out)?


Feel free to chime in if you have particular thoughts on any of these.

Best,
   Shadi

--
Shadi Abou-Zahra - http://www.w3.org/People/shadi/
Activity Lead, W3C/WAI International Program Office
Evaluation and Repair Tools Working Group (ERT WG)
Research and Development Working Group (RDWG)

This e-mail is confidential. If you are not the intended recipient you must 
not disclose or use the information contained within. If you have received 
it in error please return it to the sender via reply e-mail and delete any 
record of it from your system. The information contained within is not the 
opinion of Edith Cowan University in general and the University accepts no 
liability for the accuracy of the information provided.

CRICOS IPC 00279B
Received on Wednesday, 24 August 2011 12:33:54 UTC