RE: EvalTF Error Margin discussion 5.5 and actual evaluation from Michael S Elledge on 2012-01-18 (public-wai-evaltf@w3.org from January 2012)

From: Michael S Elledge <elledge@msu.edu>
Date: Wed, 18 Jan 2012 17:40:18 -0500
To: Kerstin Probiesch <k.probiesch@googlemail.com>, 'Detlev Fischer' <fischer@dias.de>, "public-wai-evaltf@w3.org" <public-wai-evaltf@w3.org>
Message-ID: <4F174A52.7030703@msu.edu>
Hi Everyone--

Just to clarify, the success criteria in WCAG 2.0 are what a website 
must satisfy to be compliant for a particular level (A, AA, or AAA). The 
techniques are suggested as ways to determine if success criteria have 
been met.

It is up to the evaluator to decide which, if any, of the techniques 
should be used to verify compliance. What is important is that the 
techniques be identified so their results can be replicated, or, if 
necessary, challenged.

I think we may be worrying too much about quantifying error margins. It 
isn't a question of x or y% being a threshold one must achieve for a 
site to be compliant. A site will be compliant if it has meets the 
success criteria for A, AA or AAA on every page. The lowest level of 
compliance on any page determines the overall compliance level. If a 
website fails any success criteria, it is non-compliant for that level.

Since we are living in a less-than-perfect world, however, we have to 
recognize that a website may have varying degrees of accessibility. Does 
a missing description for one image mean an entire site should be 
considered inaccessible? Probably not (depending on the content 
communicated by that image). It will fail to be compliant, of course. 
But the significance of that missing description will be based on the 
judgment of the evaluator of how it impacts using the rest of the website.

An evaluator's judgment of accessibility (which is what I think we've 
really been talking about), therefore, is by definition subjective and 
difficult to quantify. That is why I, as an evaluator, would be more 
comfortable reporting when and why a website has failed a success 
criteria, and how it could be revised to become compliant, than stating 
that it is "nearly" compliant, or "95%" compliant. If it fails a success 
criteria, it fails to be compliant.

Mike Elledge


Hi Alistair,

as mentioned in my other mail I also disagree with failures as starting
points, out of other reasons.

 > > I disagree because we ultimately want to determine two things - the
 > > website conforms / the website does not conform.
 > >
 > > If we only used the failure conditions we would only be able to
 > > determine one thing - that the website does not conform.  It would not
 > > satisfy what we want to do.
A very important point, I think. Which also describes attitudes. One point I
like most is that the WCAG 2.0 is speaking about success criteria and not
about failures criteria.

 > > The only way to determine if a website conforms, or not, is by testing
 > > whether all relevant implemented 'sufficient' techniques (or their
 > > equivalents???) have been implemented correctly - or not as the case
 > > may be.
I'm still not convinced. As mentioned in the other mail I feel "relevant" is
a critical term. I think we have to be very exactly with the wording in our
methodology. Even when we as TF would reach a consensus, this will not
guarantee that other people might interpret things in another way.

Just for a better understanding and my clarification: After we will come out
with our methodology people will develop or change their testing procedure
according to our methodology. In some way or another they have to use
formulations/wording for checkpoints in the report and for the whole
documentation. Does "testing wether all relevant...." means that the
techniques itselves are checkpoints or did I misunderstood?

I'm asking, because if sufficient techniques be used as checkpoints testing
organizations and evaluators in general have to revise those checkpoints
regularly and - another problem - the SCs can also be met with other, not
documented techniques. Techniques will maybe change, other techniques will
be included and probably we will also face deletions. Not only from a
sustainability viewpoint I think it's more effective using the SCs
themselves as checkpoints, because they will not change and evaluators could
react more quickly and flexible after updates of WCAG. I'm also not sure if
testing  techniques answers the conformance question or the question are
those technique implemented sufficiently. Is this the same or something
else?


Best

Kerstin



Hi all,

I'm currently working with a Dutch organization on exactly the same 
problem. It does seem we've departed from the actual topic off error 
margin by quite a bit. But I think this is a much more fundamental 
discussion which will help us figure out how to do error margins as 
well. I've gone over this problem quite a few times, and there are many 
(many!) angles to consider here.

Whatever solution we decide upon, there are a few things required of it:
- Ensure that the way a technology is used is accessibility supported 
(taking into account what software / hardware visitors will be using). 
This point is required by WCAG 2.
- Provide consistent results, when applied by experts, regardless of 
where they had their training. (it must be repeatable)
- It must be efficient enough to warrant using the methodology
- The outcome must give an indication of if the pages in the sample 
conform to WCAG, or if not, what problems exist

Does anyone else have any thoughts on what would be required of the 
evaluation? I think might clear things up some more.

I think that if we use the AS requirement of conformance, the 
informative status of the techniques shouldn't be a problem. If you know 
if a technique is accessibility supported, you can make claims about 
passing success criteria, without using AT for the evaluation. This 
doesn't change, even if the technique is changed. Theoretically, even if 
your website used the old technique and then suddenly found that the new 
technique was different in such a way that it wouldn't pass the test any 
more, you are still accessibility supported, and so you still conform. 
And if you still are accessibility supported, then a mistake was made 
while changing the technique. I believe that this is the only situation 
in which a website can go from conforming to not conforming when 
evaluating with techniques without changing content or AS requirements.

Regards,

Wilco

________________________________________
Van: Kerstin Probiesch [k.probiesch@googlemail.com]
Verzonden: woensdag 18 januari 2012 10:47
Aan: 'Detlev Fischer'; public-wai-evaltf@w3.org
Onderwerp: AW: EvalTF discussion 5.5 and actual evaluation

Hi Detlev, all,

> I read the warning quoted below by Kerstin as a reminder that a failure
> of one technique does not imply that the SC is not met. In that sense,
> a
> better starting point are the WCAG Failures since any fail of a test in
> a WCAG Failure means that the SC is not met.

I think neither the sufficient techniques nor the failures are a good
starting points, because of the character of the whole document as
informative. Some identified failures will be integrated in future updates
of the document, other failures might change their descriptions and even
other failures might be deleted. Relying upon techniques in a testing
procedure - even upon failures - could for a client mean that there is a
conformance before an update of the document, but no more after and vice
versa. Relying on techniques or failures as checkpoints are also a problem
for the sustainability of a testing procedure especially when checkpoints
are following techniques or failures.

> Still, I find the Techniques useful in that combined, they cover all
> common and documented ways of implementing conformance to the SC,

...which are known and that is just one of the problems.

> I also think at some point there was consensus here (contrary to what I
> had originally expected) that the methodology would not delve into the
> practicalities of specifying all the tests necessary to check if one
> technique has been used successfully to meet the SC.

I see this discussion somehow a bit outstanding, "just" for making things
clearer and also for more guarantee that we are all speaking about the same
things.

> "Make sure the tests of all known relevant techniques are applied to
> determine whether a SC has been met."

"relevant" sounds for me a bit wishy-washy. Who will decide what is relevant
and what is not relevant?

Regards

Kerstin


> If a testing tool wants to be more specific than that to help non-
> expert
> testers, it would be free to do so as long as it follows the general
> methodological framework.
>
> Regards,
> Detlev
>
>
>
>
>
> Am 16.01.2012 09:11, schrieb Kerstin Probiesch:
> > Hi Alistair, all,
> >
> > I think we should be very careful with any testing procedures which
> rely on
> > techniques. Techniques are mainly for developers/authors. In the
> Techniques
> > Document we find:
> >
> > "Test procedures are provided in techniques to help verify that the
> > technique has been properly implemented."
> >
> > And:
> >
> > "In particular, test procedures for individual techniques should not
> be
> > taken as test procedures for the WCAG 2.0 success criteria overall."
> >
> > Best
> >
> > Kerstin
> >
> >
> >
> >> -----Ursprüngliche Nachricht-----
> >> Von: Alistair Garrison [mailto:alistair.j.garrison@gmail.com]
> >> Gesendet: Samstag, 14. Januar 2012 14:45
> >> An: Eval TF
> >> Betreff: Re: EvalTF discussion 5.5 and actual evaluation
> >>
> >> Dear All,
> >>
> >> To my mind there are no massively different ways to evaluate the
> WCAG
> >> 2.0 guidelines - seemingly, intentionally so.  We also don't need to
> >> take one of the WCAG 2.0 checkpoints and determine a way to assess
> it -
> >> as this has already been done for us.
> >>
> >>  From WCAG 2.0 it seems reasonably clear that you (in some way)
> >> determine which techniques are applicable to the content in the
> pages
> >> you want to assess, then you simply follow the Test Procedures
> >> prescribed in each of the applicable techniques. It does not matter
> if
> >> you do this one by one, per theme, per technology etc... that is
> surely
> >> up to whatever you think is best at the time.
> >>
> >> Again, I'm a little concerned that we might be wandering towards
> >> recreating test procedures for individual techniques, when as
> mentioned
> >> that part has already been done by the WCAG 2.0 techniques working
> >> group. Isn't it the higher level question of how to approach the
> >> evaluation of a website (or conformance claim), and capture results,
> in
> >> a systematic way that we need to be answering?
> >>
> >> For example, an approach such as...
> >>
> >> 1) Clearly define what you want to test - the WCAG 2.0 Conformance
> >> Claim (or in its absence our website scoping method)...
> >> 2) Determine which techniques are applicable - by looking through
> these
> >> pages and finding relevant content, marking techniques non-
> applicable
> >> if no applicable content can be found.
> >> 3) Running all relevant test procedures (defined in applicable
> >> techniques) against all applicable content (found in 2).
> >> 4) Finally recording pass, fail or non-applicable for each relevant
> >> technique, and then determining from this all passed, failed and
> non-
> >> applicable checkpoints / guidelines.  Noting that there are several
> >> techniques available for doing certain things.  (Note: this is
> another
> >> reason why we might use the Conformance claim as techniques which
> have
> >> been used will hopefully be recorded, rather than us having to
> assess
> >> all techniques for a certain thing, until one is passed).
> >>
> >> Just my thoughts...
> >>
> >> Very best regards
> >>
> >> Alistair
> >>
> >> On 14 Jan 2012, at 05:38, Vivienne CONWAY wrote:
> >>
> >>> HI Richard and all TF
> >>> While I understand the need to look at the procedures from an
> overall
> >> perspective first, I agree with Richard that it may be time to try
> out
> >> a few idea for practical implementation.  It may be a good idea for
> us
> >> all to take one of the WCAG 2.0 checkpoints and determine a way to
> >> assess it.  However, I remember (think it might have been Detlev)
> >> proposed that we do this and it was decided that we wouldn't be
> dealing
> >> with each point individually.  Or did I misunderstand?
> >>>
> >>>
> >>> Regards
> >>>
> >>> Vivienne L. Conway, B.IT(Hons)
> >>> PhD Candidate&  Sessional Lecturer, Edith Cowan University, Perth,
> >> W.A.
> >>> Director, Web Key IT Pty Ltd.
> >>> v.conway@ecu.edu.au<mailto:v.conway@ecu.edu.au>
> >>> v.conway@webkeyit.com<mailto:v.conway@webkeyit.com>
> >>> Mob: 0415 383 673
> >>>
> >>> This email is confidential and intended only for the use of the
> >> individual or entity named above. If you are not the intended
> >> recipient, you are notified that any dissemination, distribution or
> >> copying of this email is strictly prohibited. If you have received
> this
> >> email in error, please notify me immediately by return email or
> >> telephone and destroy the original message.
> >>>
> >>> ________________________________
> >>> From: RichardWarren [richard.warren@userite.com]
> >>> Sent: Saturday, 14 January 2012 10:32 AM
> >>> To: Eval TF
> >>> Subject: Re: EvalTF discussion 5.5 and actual evaluation
> >>>
> >>> Dear TF,
> >>>
> >>> I cannot help thinking that we would save a lot of time and
> >> discussion if we concentrated on procedures for evaluation (5.3)
> where
> >> we are going to try “ to propose different ways to evaluate the
> >> guidelines: one by one, per theme, per technology, etc” .  As we do
> >> that we will come across the various technologies (5.2) and possibly
> >> come up with a few acceptable ways of dealing with “occasional
> errors”
> >> etc. if and when relevant to a particular guideline. This approach
> may
> >> be more efficient than trying to define systemic and incidental
> errors
> >> in a non-specific guideline context.
> >>>
> >>> I wonder if now is the time to get to the core of our task and
> start
> >> working on actual procedures where we can discuss levels of
> compliance
> >> and any effect in a more narrow, targeted environment.
> >>>
> >>> Regards
> >>> Richard
> >>>
> >>>
> >>> From: Elle<mailto:nethermind@gmail.com>
> >>> Sent: Friday, January 13, 2012 11:35 PM
> >>> To: Vivienne CONWAY<mailto:v.conway@ecu.edu.au>
> >>> Cc: Alistair Garrison<mailto:alistair.j.garrison@gmail.com>  ;
> Shadi
> >> Abou-Zahra<mailto:shadi@w3.org>  ; Eval TF<mailto:public-wai-
> >> evaltf@w3.org>  ; Eric Velleman<mailto:evelleman@bartimeus.nl>
> >>> Subject: Re: EvalTF discussion 5.5
> >>>
> >>> TF:
> >>>
> >>> I have been reading the email discussions with avid interest and
> very
> >> little ability to add anything valuable yet.  My point of view seems
> to
> >> be very different from most in the group, as my job is to meet and
> >> maintain this conformance at a large organization. I'm learning
> quite a
> >> bit from all of you.
> >>>
> >>> I've been following this particular topic with a keen interest in
> >> seeing what a "margin of error" would be defined as, in part because
> >> our company is about to launch into a major site consolidation and
> I'm
> >> curious about how to scale our current testing process.  Until now,
> >> we've actually been testing every page we can with both automated
> scans
> >> and manual audits.
> >>>
> >>>>  From a purely layman's point of view, the only confidence I have
> >> when testing medium to large volume websites (greater than 500
> pages)
> >> is by doing the following:
> >>>
> >>> 1. automated scans of every single page
> >>> 2. manual accessibility testing modeled after the user acceptance
> >> test cases to test the critical user paths as defined by the
> business
> >>> 3. manual accessibility testing of each page type and/or widget or
> >> component (templates, in other words)
> >>>
> >>> So, I felt the need to chime in on "margin of error," because it
> >> worries me when we start quantifying a percentage of error. I see
> this
> >> from the corporate side.  Putting a percentage on this may actually
> >> undermine the overall success of accessibility specialists working
> >> inside of a large organization.  We may find ourselves with more
> >> technical compliance and less overall usability for disabled users.
> As
> >> for me, I need to be able to point to an evaluation technique that
> >> encompasses more than a codified measurement in my assessment of a
> >> website's conformance.  Ideally, the  really needs to account for
> user
> >> experience.  It's one of the fail safes in the current 508
> Compliance
> >> requirements that I've taken shelter in, actually, as outdated as
> they
> >> are - functional performance criteria.
> >>>
> >>> I really appreciate the work everyone in this group is doing, as I
> >> will likely be a direct recipient of the outcome as I put these
> >> concepts into action over the course of their creation.  Consider me
> >> the intern who will try to see if these dogs will hunt. :)
> >>>
> >>>
> >>> Much appreciated,
> >>> Elle
> >>>
> >>>
> >>> On Thu, Jan 12, 2012 at 8:10 PM, Vivienne CONWAY
> >> <v.conway@ecu.edu.au<mailto:v.conway@ecu.edu.au>>  wrote:
> >>> Hi Alistair and TF
> >>> You have raised an interesting point here.  I'm thinking I like
> your
> >> idea better than the 'margin of error' concept.  It removes the
> >> obstacle of trying to decide what constitutes an 'incidental' or
> >> 'systemic' error.  I thnk it's obvious that most of the time a
> website
> >> with systemic errors would not pass, unless it was sytem-wide and
> >> didn't pose any serious problem ie.a colour contrast that's .1 off
> the
> >> 4.5:1 rule.  I think I like the statement idea coupled with a
> >> comprehensive scope statement of what was tested.
> >>>
> >>>
> >>> Regards
> >>>
> >>> Vivienne L. Conway, B.IT<http://B.IT>(Hons)
> >>> PhD Candidate&  Sessional Lecturer, Edith Cowan University, Perth,
> >> W.A.
> >>> Director, Web Key IT Pty Ltd.
> >>> v.conway@ecu.edu.au<mailto:v.conway@ecu.edu.au>
> >>> v.conway@webkeyit.com<mailto:v.conway@webkeyit.com>
> >>> Mob: 0415 383 673
> >>>
> >>> This email is confidential and intended only for the use of the
> >> individual or entity named above. If you are not the intended
> >> recipient, you are notified that any dissemination, distribution or
> >> copying of this email is strictly prohibited. If you have received
> this
> >> email in error, please notify me immediately by return email or
> >> telephone and destroy the original message.
> >>> ________________________________________
> >>> From: Alistair Garrison
> >>
> [alistair.j.garrison@gmail.com<mailto:alistair.j.garrison@gmail.com>]
> >>> Sent: Thursday, 12 January 2012 6:41 PM
> >>> To: Shadi Abou-Zahra; Eval TF; Eric Velleman
> >>> Subject: Re: EvalTF discussion 5.5
> >>>
> >>> Hi,
> >>>
> >>> The issue of "margin of error" relates to the size of the website
> and
> >> the number of pages actually being assessed.  I'm not so keen on the
> >> "5% incidental error" idea.
> >>>
> >>> If you assess 1 page from a 1 page website there should be no
> margin
> >> of error.
> >>> If you assess 10 pages from a 10 page website there should be no
> >> margin of error.
> >>> If you assess 10 pages from a 100 page website you will have
> >> certainty for 10 pages and uncertainty for 90.
> >>>
> >>> Instead of exploring the statistical complexities involved in
> trying
> >> to accurately define how uncertain we are (which could take a great
> >> deal of precious time) - could we not just introduce a simple
> >> disclaimer e.g.
> >>>
> >>> "The evaluator has tried their hardest to minimise the margin for
> >> error by actively looking for all content relevant to each technique
> >> being assessed which might have caused a fail."
> >>>
> >>> Food for thought...
> >>>
> >>> Alistair
> >>>
> >>> On 12 Jan 2012, at 10:04, Shadi Abou-Zahra wrote:
> >>>
> >>>> Hi Martijn, All,
> >>>>
> >>>> Good points but it sounds like we are speaking more of impact of
> >> errors rather than of the incidental vs systemic aspects of them.
> >> Intuitively one could say that an error that causes a barrier to
> >> completing a task on the web page needs to be weighted more
> >> significantly than an error that does not have the same impact, but
> it
> >> will be difficult to define what a "task" is. Maybe listing specific
> >> situations as you did is the way to go but I think we should not mix
> >> the two aspects together.
> >>>>
> >>>> Best,
> >>>> Shadi
> >>>>
> >>>>
> >>>> On 12.1.2012 09:41, Martijn Houtepen wrote:
> >>>>> Hi Eric, TF
> >>>>>
> >>>>> I would like to make a small expansion to your list, as follows:
> >>>>>
> >>>>> Errors can be incidental unless:
> >>>>>
> >>>>> a) it is a navigation element
> >>>>> b) the alt-attribute is necessary for the understanding of the
> >> information / interaction / essential to a key scenario or complete
> >> path
> >>>>> c) other impact related thoughts?
> >>>>> d) there is an alternative
> >>>>>
> >>>>> So an unlabeled (but required) field in a form (part of some key
> >> scenario) will be a systemic error.
> >>>>>
> >>>>> Martijn
> >>>>>
> >>>>> -----Oorspronkelijk bericht-----
> >>>>> Van: Velleman, Eric
> >> [mailto:evelleman@bartimeus.nl<mailto:evelleman@bartimeus.nl>]
> >>>>> Verzonden: woensdag 11 januari 2012 15:01
> >>>>> Aan: Boland Jr, Frederick E.
> >>>>> CC: Eval TF
> >>>>> Onderwerp: RE: EvalTF discussion 5.5
> >>>>>
> >>>>> Hi Frederick,
> >>>>>
> >>>>> Yes agree, but I think we can have both discussions at the same
> >> time. So:
> >>>>> 1. How do we define an error margin to cover non-structuraal
> >> errors?
> >>>>> 2. How can an evaluator determine the impact of an error?
> >>>>>
> >>>>> I could imagine we make a distinction between structural and
> >> incidental errors. The 1 failed alt-attribute out of 100 correct
> ones
> >> would be incidental... unless (and there comes the impact):
> >>>>>   a) it is a navigation element
> >>>>>   b) the alt-attribute is necessary for the understanding of the
> >> information / interaction
> >>>>>   c) other impact related thoughts?
> >>>>>   d) there is an alternative
> >>>>>
> >>>>> We could set the acceptance rate for incidental errors. Example:
> >> the site would be totally conformant, but with statement that for
> alt-
> >> attributes, there are 5% incidental fails.
> >>>>> This also directly relates to conformance in WCAG2.0 specifically
> >> section 5 Non-interference.
> >>>>>
> >>>>> Eric
> >>>>>
> >>>>>
> >>>>>
> >>>>> ________________________________________
> >>>>> Van: Boland Jr, Frederick E.
> >> [frederick.boland@nist.gov<mailto:frederick.boland@nist.gov>]
> >>>>> Verzonden: woensdag 11 januari 2012 14:32
> >>>>> Aan: Velleman, Eric
> >>>>> CC: Eval TF
> >>>>> Onderwerp: RE: EvalTF discussion 5.5
> >>>>>
> >>>>> As a preamble to this discussion, I think we need to define more
> >> precisely ("measure"?) what an "impact" would be (for example,
> impact
> >> to whom/what and what specifically are the consequences of said
> >> impact)?
> >>>>>
> >>>>> Thanks Tim
> >>>>>
> >>>>> -----Original Message-----
> >>>>> From: Velleman, Eric
> >> [mailto:evelleman@bartimeus.nl<mailto:evelleman@bartimeus.nl>]
> >>>>> Sent: Wednesday, January 11, 2012 4:15 AM
> >>>>> To: public-wai-evaltf@w3.org<mailto:public-wai-evaltf@w3.org>
> >>>>> Subject: EvalTF discussion 5.5
> >>>>>
> >>>>> Dear all,
> >>>>>
> >>>>> I would very much like to discuss section 5.5 about Error Margin.
> >>>>>
> >>>>> If one out of 1 million images on a website fails the alt-
> attribute
> >> this could mean that the complete websites scores a fail even if the
> >> "impact" would be very low. How do we define an error margin to
> cover
> >> these non-structural errors that have a low impact. This is already
> >> partly covered inside WCAG 2.0. But input and discussion would be
> >> great.
> >>>>>
> >>>>> Please share your thoughts.
> >>>>> Kindest regards,
> >>>>>
> >>>>> Eric
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>
> >>>> --
> >>>> Shadi Abou-Zahra - http://www.w3.org/People/shadi/
> >>>> Activity Lead, W3C/WAI International Program Office
> >>>> Evaluation and Repair Tools Working Group (ERT WG)
> >>>> Research and Development Working Group (RDWG)
> >>>>
> >>>
> >>> This e-mail is confidential. If you are not the intended recipient
> >> you must not disclose or use the information contained within. If
> you
> >> have received it in error please return it to the sender via reply
> e-
> >> mail and delete any record of it from your system. The information
> >> contained within is not the opinion of Edith Cowan University in
> >> general and the University accepts no liability for the accuracy of
> the
> >> information provided.
> >>>
> >>> CRICOS IPC 00279B
> >>>
> >>>
> >>>
> >>>
> >>> --
> >>> If you want to build a ship, don't drum up the people to gather
> wood,
> >> divide the work, and give orders. Instead, teach them to yearn for
> the
> >> vast and endless sea.
> >>> - Antoine De Saint-Exupéry, The Little Prince
> >>>
> >>>
> >>> ________________________________
> >>> This e-mail is confidential. If you are not the intended recipient
> >> you must not disclose or use the information contained within. If
> you
> >> have received it in error please return it to the sender via reply
> e-
> >> mail and delete any record of it from your system. The information
> >> contained within is not the opinion of Edith Cowan University in
> >> general and the University accepts no liability for the accuracy of
> the
> >> information provided.
> >>>
> >>> CRICOS IPC 00279B
> >>>
> >
> >
> >
Received on Wednesday, 18 January 2012 22:41:29 UTC