Re: Proposal: severity axis on test result from Ian Hickson on 2002-07-03 (w3c-wai-er-ig@w3.org from July 2002)

From: Ian Hickson <ian@hixie.ch>
Date: Wed, 03 Jul 2002 13:30:51 +0100
To: Charles McCathieNevile <charles@w3.org>
Cc: w3c-wai-er-ig <w3c-wai-er-ig@w3.org>
Message-ID: <3D22EE7B.1080101@hixie.ch>
Charles McCathieNevile wrote:
 > [numeric value vs enumerated value for confidence and severities]
 >
 > You will have an internal mapping of numeric results that is
 > specific to your needs, and the next person will too, but theirs
 > will be slightly different. If you map both of those to the same
 > EARL scheme by numeric value, you lose the ability to compare
 > between them without doing complicated searching based on who said
 > what and then doing the mapping

I'm not convinced you have that ability now.

If Alice has two confidence levels ('confident' and 'unsure') and Bob
has two confidence levels ('1' and '2'), and Alice maps 'confident' to
EARL's 'high' confidence and 'unsure' to EARL's 'low' confidence,
while Bob maps '1' to 'high' and '2' to 'medium', then your data is
just as useless as if they had mapped their values to four different
numbers.

I don't think EARL having enumerated types will make comparing
merged results from people who did not agree on a convention before
hand any easier.

Similarly, if David has a numeric confidence scale, when he maps it to
EARL's three levels, he's going to actually lose detail.

I think it is important that you not _lose_ data when exporting to
EARL. IMHO, you should be able to round trip data from a system, to
EARL, and back again, without losing information. (The opposite is not
true; there is no guarentee that arbitrary EARL should be able to be
round tripped through an arbitrary EARL-aware system, because that
depends entirely on that system's capabilities.)


 > you don't get essentially meaningless EARL data like "this is 66%
 > conformant to a checkpoint that may or may not have a definition of
 > what that means"

"this is 'low' conformant to a checkpoint that may or may not have a
definition of what that means"

I don't see how enumerated types in any way solve that problem.


 >>> For some use cases your Yb - pass with unrelated errors would
 >>> count as a pass, and for some cases it would score as a fail. So
 >>> we would need to know what they are.
 >
 > We need to know what the use case is, and what the errors are, to
 > categorise whether a result means "only failed because of
 > dependencies" or "passed the essential bit".

Oh, right. Ok. This test:

    http://www.hixie.ch/tests/adhoc/css/inheritance/content/002.xml

...would get a Y if it read:

    (Control: PASSED)
    Test 1 has: PASSED
    Test 2 has: PASSED

However, if it said that but in addition some of the lines overlapped
each other, then the test would have passed with unrelated errors, and
would get a Yb (for "Yes with bugs"). (PASS, Confidence: 100%,
Severity: 90%)

However, if it read

    (Control: PASSED)
    Test 1 has:
    Test 2 has:

...as it does in recent Mozilla builds, then the test would get a B
(for "Buggy") (FAIL, Confidence: 100%, Severity: 50%).


Does that answer your question?


 > But maybe this isn't necessary if we just say "use a more atomic
 > test". I don't know if that works in the real world or not.

Atomic testing is not the only kind of testing required. It is
absolutely imperative that browsers be tested with very complicated
tests, as many bugs only become apparent when two or more features
interact. If EARL is to be useful in this scenario, it has to deal
with non-atomic tests.

For example, the Box Acid Test:

    http://www.w3.org/Style/CSS/Test/CSS1/current/test5526c.htm

If this test looks like the screenshot, then in my nomenclature the UA
gets a Y ("Yes") for this test. If it looks like the screenshot but
the radio buttons don't work (e.g. think they are part of the same
group) then it gets a Yb ("Yes, with minor unrelated bugs"). If the
page is mostly correct but the <body> isn't sized correctly, it gets a
B ("Buggy") and if the page doesn't look right, it gets a D ("Destroys
this feature", since if the page looks wrong it means the UA doesn't
support the block box model, a critical part of CSS1).

This is not an atomic test; it's one of the most complicated CSS1
tests on the W3C CSS1 test suite. And it is by far not the most
complicated CSS test around.


 >> You don't need "Can't Tell" as that should just be "Pass" or "Fail"
 >> with "Confidence: 0%". (Whether you pick "pass" or "fail" depends
 >> on which is the "default" -- e.g. in a test where supporting the
 >> feature correctly and not even trying to support the feature are
 >> indistinguishable, you would use "Pass", while in a test where
 >> trying to support the feature but doing so _incorrectly_ is
 >> indistinguishable from not supporting the feature at all you would
 >> use "Fail".)
 > Not sure that I agree here. Having a pass and a fail be equivalent
 > under circumstances that are outside the RDF model we are using
 > seems likely to lead to problems in processing.

They aren't equivalent. Pass with 0% confidence is a good result (it
means the UA could in fact have completely passed the test) whereas
Fail with 0% confidence is a bad result (the test definitely failed,
it's just unclear whether it was because of lack of support or because
of a bug).


 > If there is a cannotTell then I think it should be a single defined
 > property. More to the point, what is the use case for it in the
 > first place?

This test:

    http://www.hixie.ch/tests/evil/css/import/extra/linklanguage.html

...looks exactly the same in UAs that support CSS completely, and
those that have zero CSS support. Therefore you can't tell whether or
not it passed, only whether it failed.

So if the test failed, it gets a B (FAIL, Confidence: 100%, Severity:
50%) but if it passes then it gets an M (for "Maybe", in EARL terms
that would be a PASS, Confidence: 0%, Severity: 100%).

If you were looking for whether the UA passed a checkpoint, which had
that test as a prerequisite, then you would say the checkpoint had
passed if that test was marked PASS, even with Confidence: 0%.

On the other hand, if you had a test with FAIL, Confidence: 0%, then
you would know that the checkpoint had not been passed, you just
wouldn't know for sure whether it had failed because of intentional
lack of support or whether it had failed because of a bug.


 >> "Not Applicable" is important for tests that neither pass nor fail, such as a
 >> test for making sure all images have alternate text, when applied to a document
 >> with no images, or a test to make sure that 'red' is rendered different from
 >> 'green', on a monochromatic device.
 >
 > (CMN I agree with your conclusion, but testing whether 'green' and 'red' are
 > rendered differently on a monochromatic device is in fact important...)

...or a test for whether the 'color' property is correctly parsed, in
a UA that doesn't support CSS, then. :-)

-- 
Ian Hickson                                      )\._.,--....,'``.    fL
"meow"                                          /,   _.. \   _\  ;`._ ,.
http://index.hixie.ch/                         `._.-(,_..'--(,_..'`-.;.'
Received on Wednesday, 3 July 2002 08:40:08 UTC