Re: Seeking for advices on results from automated web evaluation tools from Karl Groves on 2010-10-04 (w3c-wai-ig@w3.org from October to December 2010)

From: Karl Groves <karl@karlgroves.com>
Date: Mon, 4 Oct 2010 10:32:35 -0400
To: Salinee Kuakiatwong <salinee20@gmail.com>
Cc: w3c-wai-ig@w3.org
Message-ID: <AANLkTi=FcVnbczeacPjBqiuJcu=HwiyHOKiu9uU_0vtL@mail.gmail.com>
Salinee,

Right now there is a big shift in the way automated testing is
performed.  Currently if you were to do an inventory of all of the
automated testing tools out there (free and non-free) you'll find four
things which will create differences between one tool and the next.
1) Standards Support:  Tools which only support WCAG 1.0 vs. those
which support WCAG 2.0 and;
2) What They Test: Tools which test the document source as a string
vs. tools which test the DOM
3) How They Handle Subjective Guidelines:  in other words, what they
do with a guideline that takes human judgement
4) Report Clarity: how clear the reports, including individual findings, are

Under criteria #1, typically what you'll find with "WCAG 1.0/ Section
508-only" tools is that they're old.  Their testing rules are out of
date and they're not under active development anymore.  Throw these
away.  Luckily, I don't think there are many "enterprise" automated
testing tools these days that don't support WCAG 2.0.

This second criteria, however, is a big deal. Early in the days of
automated testing tools what you'd have basically, is tools which
tested document source as it was sent by the server.  I call this "as
a string" because all they'd do is grab the source (just like you
"View Source" in your browser), parse it to create a multidimensional
array of elements, loop through them, running a bunch of tests and
generating reports.   This is (usually) just fine if what you're
testing a completely static document with no client-side interactivity
or other scripted DOM manipulation.  The big problem with such an
approach is that in order to be an accurate test, an automated tool
must test what the end user is getting. That means, in the case of
pages with client-side scripting, the testing tool needs to get the
modified version (after the scripting has changed the DOM) to test.
So, depending upon what the tool tests, you could get significantly
different results simply because they're testing different things
altogether.

The next issue is how the tools handle guidelines which require
subjective interpretation.
In my own personal research, I've found that there are only 16 basic
types of accessibility tests.  For the sake of brevity I won't list
them here, but they're things like "Element ____ contains attribute
______"  or  "Element _____ has child element ______".    Using this
type of structure (and a tool which can test the DOM), you can
generate hundreds of tests.   *Some* of those tests are rather
absolute in nature.   For instance the classic test for whether an
image has an alt attribute or not is an example of a test that any
tool can perform and report on accurately and clearly.

It is when you get into subjective interpretation that you begin to
see wildly different results from automated tools.  For instance when
testing for WCAG 2.0 Guideline 1.1.1 we're not just testing for the
existence of an alt attribute.  The guideline discusses many possible
situations that each will require their own different type of
alternate text - some (most) of which simply cannot be tested with any
degree of certainty using automated testing.  In such cases, the best
that can be done is to generate a "warning" signifying that something
*might* be wrong.  In other cases, it might even be prudent to not
test for every success criteria at all.  You will see major
differences in what each tool tests and how they test it when it comes
to subjective tests.  For instance, some may have tests against some
specified string length threshold for img alt attributes and generate
an error if it is too short or too long.  Clearly there are instances
where a short alt attribute is appropriate and only human subjectivity
can determine if that is really an issue, so some may call this a
"warning" and not an error.

Last is the issue of report clarity.  So I've already listed that
tools may test different things (string vs. DOM), may test more (or
less) thoroughly, and may include issues that require subjective
interpretation.  On top of it all, the reports that each tool gives
you may vary significantly in how clear the report is. Each tool may
use different nomenclature than the next. Further, their explanation
of issues may also be different - and the more subjective the test
result, the more different the issue description can be. What this
means for your research is that both tools might have found a
particular issue but they report them differently. Be sure to read the
reports carefully when doing your comparison.

As Chaals stated, you'll want to do a manual evaluation yourself.  I'd
recommend creating a web page (or, even better a whole site) filled
with errors, create a list of those errors as your test criteria and
then test automatically with each tool to see how they perform.

Best of luck to you,

Karl



On Mon, Oct 4, 2010 at 4:14 AM, Salinee Kuakiatwong <salinee20@gmail.com> wrote:
> Dear All,
> I'm writing a research paper to investigate the inter-reliability of
> automated evaluation tools. I used two automated web evaluation tools to
> scan the same web pages. The findings indicates there are highly
> discrepancies in the results between both tools although they're based on
> the same standard (WCAG 2.0).
> I'm new to the field. Any explanation for such a case?
> Thanks!
Received on Monday, 4 October 2010 14:55:06 UTC