% and "almost passed" Re: browsable test results from Charles McCathieNevile on 2003-09-12 (www-qa@w3.org from September 2003)

From: Charles McCathieNevile <charles@w3.org>
Date: Fri, 12 Sep 2003 12:43:44 -0400 (EDT)
To: Patrick Curran <Patrick.Curran@Sun.COM>
Cc: david_marston@us.ibm.com, www-qa@w3.org, w3c-wai-er-ig@w3.org
Message-ID: <Pine.LNX.4.55.0309121223420.25313@homer.w3.org>

On Fri, 12 Sep 2003, Patrick Curran wrote:

>david_marston@us.ibm.com wrote:
>>From the point of view of QAWG guidelines, I see a problem in the
>result table where you report that a product has passed 100% of the
>test cases for each group. The string "No failures found" would be
>more appropriate. The issues about percentages include:
>1. Implication that the current suite is 100% of all the tests that
>   should be there. Test suites that are being expanded frequently
>   won't have a stable notion of 100%.
>
Patrick:
>Isn't there a more fundamental problem? Test suites should be versioned. It
>ought to be OK to state "I passed x% of the test cases for version y.z of
>the test suite." The number of test cases in any particular version of the
>test suite should be, of course, fixed.

Well, there may be a case for making a statement like that, but in general
david's following points (and your agreement) suggest to me that claiminig
percentages is not often a good thing.

EARL explicitly rejected the idea in discussions in the past, although we did
discuss the idea of people subclassing earl:fail - a failure because some
prerequisite requirement is failed isn't always the same thing as a total
catastrophic failure, and this would be useful in certain domains. (Just as
the OWL group seem to find duration of a test useful information in their
domain).

An RDF processor that can do every spec in the framework, and all the OWL
stuff, except that it uses dc: and foaf: as built-in non-varying namespaces
and doesn't check the xmlns declaration for them might pass almost all of the
tests, but it does have a major flaw. Or is it a trivial quirk?

The other notion implicit in the EARL design is that Test Suites are
collections of tests, and that one test is "passing all of a Test Suite" - in
other words they are quite likely to be nestable. Hence the heuristic mode
for a test.

Put another way, if test suite B is really test suite A plus tests b.1 b.2
b.3 and minus a.6, then passing A, b.1, b.2, b.3 probably means you pass B.
Passing a except for a.6, and passing b.1, b.2, b.3 also means you pass B.
(This is a simplification of the real-world situation with WCAG [1] and the
US government's section 508 requirements for accessible web content)

One of the requirements of EARL was to support gathering different test
results from different sources and using them together (as Shadi said) to get
some interesting information. In real world use for accessibility there are a
number of implemented tools and there is active development. These overlaps
get very complex, but having the raw information about which things were
deemed to have been passed (and by whom, to provide for trust management)
makes this possible - even if some results describe a suite of things that
were passed, such as "double-A conformance to WCAG". Having a percentage of
things passed (or more to the point failed) breaks this pretty
comprehensively for many testing processes - although it is encouraging to
generate the number and see it going up...

cheers

chaals

[1] http://www.w3.org/TR/WCAG10
David:
> 2. Implication that high numbers under 100% are pretty good. Each
>   class of product may have its own notion of how seriously interop
>   has been hurt by a score of, say, 96%.
>3. Implication that all tests count equally. Product X's 96% might
>   be much worse than Product Y's 96%, depending on the cases that
>   comprise the failing 4% on each.
>.................David Marston

Patrick:
>Right. Attempting to interpret any claim other than "I passed all the tests"
>is a dangerous business. It can have value for the implementors, and help
>them to figure out where they need to improve, but should not be used to
>compare one implementation with another...

Received on Friday, 12 September 2003 12:43:44 UTC