Re: Discussion: How to weight different accessibility warnings?

[disclaimer -- this was written Tuesday.

I held back from posting it because I had not addressed Nick's top concern,
which was to assess the reasonableness of the confidences he had assigned.  I
still haven't done that.

Neither have I digested the whole chat log.  But in the chat log Wendy was
questioning the need for 'confidence' beyond pass/fail, and I strongly support
Nick's attempt to distinguish the relationship between the condition that the
computer can evalutate and the checkpoint in statistical confidence terms,
roughly speaking.  So for better or worse here is a long brain dump on
stream-of-unconsciousness form that Nick's post inspired.  - Al]

The basic principle is that the "overall evaluation" question needs to be
answered by WCAG, not ER.

A composite rollup combining priority and confidence is an application of the
priorities, and comes within the scope of the WCAG in terms of
"interpretations" of the guidelines.

This does not mean that the WCAG are ready to give a quick answer.

On the other hand, the WCAG would be considerably aided in their search for
consensus on what in WCAG2 should fill the role of priorities in WCAG 1. 
Finding a way that this tool can be applied to the WCAG2 draft criteria, let
WCAG look at sample reports, and react to that level of prototyping would allow
WCAG as a body to understand the choices before them much better than is
possible without the tool support.

More inline below.

At 08:49 PM 2002-02-05 , Nick Kew wrote:
>
>Page Valet now offers fairly comprehensive page evaluation against
>the WCAG and US Section 508 accessibility guidelines.
>
>I'm now working through the issues of
>
>(1) distinguishing errors from warnings , and
>(2) assigning an overall evaluation to a document
>
>To do so, I've established a set of confidence levels, and assigned
>one to each test.  This is in principle orthogonal to the WCAG
>priorities, and should measure how likely Page Valet thinks it is
>that a guideline has in fact been breached:
>
>e.g. - a Frame without a title is clearly a breach, so we can flag it
>       with high confidence.
>     - <strong>This text is emphasised</strong> might possibly be a
>       header, so we query whether it should be.  But the chances are
>       it's being correctly used, so this is a low-confidence warning.
>
>I've now used five levels:
>  - Certain: we know this violates a guideline; no human check required.
>  - High: A construct that is likely to be wrong, but we're not certain.
>  - Medium: We can't tell; human checking required
>  - Low: Something that's probably OK, but should be flagged for checking.
>  - "-": Messages that definitely don't mean there's a problem.

This last needs better definition.  Why is there any event thrown?

Commonly there is a category of loggable events which are not in and of
themselves signs of anything wrong for certain but when there is something
wrong they aid in the traceback.  Sufficiently exceptional events to be notable
to the notes.  In case it becomes an issue.

Major milestones in the success path are in this category, such as form
submitted.

>
>In producing an overall document score, we simply evaluate the
>highest confidence warning anywhere in the document:
>
>  - Certain => Fail
>  - High => Probable Fail - check messages
>  - Medium => Uncertain - check messages carefully!
>  - Low => Probable Pass - check messages
>  - '-' => Pass - no problems found
>

It would be interesting to set quantitative targets for what the statistics
would be in an ideal world for these grades.

In monitoring semiconductor production at TI, they used to plot the following
quantile points in lot parameters:  5%, 25%, 50%, 75%, 95%.  They found this to
be highly revealing and about the right amount of information to present so as
not to lose important events in a cluttered display.

The tradition in the social sciences is more like
1%, 5%, 50%, 95%, 99%.

YMMV.

But talking about the confidence levels inspires comparison with confidence
assessment practices in statistics or we sorta need to get quantitative or get
another word.

>(unconditional pass is very hard indeed, but /WAI/ER/ scores it
>at WCAG single-A :-)

I beg your pardon?  There is no purely machinable way to arrive at a WCAG 1.0
single-A assertion.

>
>Now, the Big Issue is assigning priorities.  While the basic principle
>is to describe confidences, that is inevitably often subjective,
>and I'd really like some feedback on whether people agree with my
>assignments.  

Get Jim Ley to build you a spider and gather some data.

Actually, get Jim Ley to build WCAG a spider, and let WCAG do the 'truth'
assessments that you compare with the 'prognostic' assesments that you can
generate in an automated first pass.  Then base the confidence on field
demographics.  Nobody can argue with a statement of the form

"95% of the hit-weighted web content on the web that flunks this test flunks
in-depth manual assessment by the experts of the WCAG WG.  So please give it
your careful attention."

I should add that I have made some conscious decisions to
>stray from the True Path of Confidence, in deference to real-world
>considerations.  For example, presentational HTML will generate
>a message "Use CSS for layout and presentation" at WCAG-AA or higher.
>(<http://www.w3.org/TR/WCAG10/#tech-style-sheets>http://www.w3.org/TR/WCAG
10/#tech-style-sheets), but the "border"
>attribute is low-confidence (IMO it's not really harmful and it
>does have legit. uses as a browser workaround) while other
>presentational things will generate higher-confidence warnings.
>

As Kynn has pointed out, a candidate warning nominated by the detection of
presentational attributes used in the HTML may be entirely pruned away by
checking for the presence of the CSS to "do it right."  That's a higer-level
rollup that you can do off the logic of the checkpoint itself.

The kinds of things that tool algorithms can do without treading on WCAG turf
is prioritizing the display to the user of items that are equivalent in WCAG
priority terms.  And there is a lot to be done here.

Heuristics to guess which of the IMG elements lacking ALT is likely the most
egregious offendor, to convince the person receiving the report that this is a
problem and they need to consider it.

These heuristics are a subject for demographic research and the results would
be useful for WCAG in terms of prioritizing their efforts.

One example of what is wrong, presented in extenso using the detailed data (the
actual image, e.g.) with a mockup of an authoring-tool-prompt-for-ALT (embedded
in the surrounding text so that the flow throught the ALT text is graphically
obvious) and then, at a hyperlink's remove, a list of similar violations.  Then
on to a qualitatively distinct category of flagged items.  Design the report as
an effective web page.

Grouping these groups can perhaps be influenced by WCAG priority levels, but
within groups you get to play games.

The idea of an exhaustive report is exhausting.  At least to my fevered brain
at the moment, what we want to do is to _lead the user through a full repair
cycle for one defect_ before moving on to others where they must make judgement
calls.  There is a cost-performance ratio to be considered in which errors to
present first.  The ones that are gimme's -- where the fix is easy -- may be
what you want to do first.  Where all they have to do is say "yes, change it to
what you have suggested."  Then gradually move up the cost scale to other items
where they have to work more to make a repair, and down the benefit scale where
the impacts are smaller and/or the evidence less clear.

Al

>Please folks, play with it, and let me know if you think my
>confidence levels make sense!
>
>-- 
>Nick Kew
>
>Site Valet - the mark of Quality on the Web.
><URL:<http://valet.webthing.com/>http://valet.webthing.com/>
>  

Received on Thursday, 7 February 2002 19:12:59 UTC