Re: Proposal: severity axis on test result

On Mon, 1 Jul 2002, Ian Hickson wrote:

  Charles McCathieNevile wrote:
  > Hmm. This was similar to somethingthe UAAG group wanted.
  >
  > I would propose that we have qualitative rather than quantitative ratings.

  Why? I personally would much rather both "confidence" and "severity" were
  changed to use percentages (or some other scale, e.g. 0.0 .. 1.0). Having an
  enumerated set of values artificially limits applications. For example, I need
  four severities (100%, 90%, 50%, 0%) and two confidence levels (100%, 0%),
  whereas the current spec only allows for 1 severity (100%) and three confidence
  levels (~100%, ~66%, ~33%).

  Using a numeric scale doesn't reduce interoperability, as applications can
  simply "pigeon hole" values into their internal enumerated types. (The problem
  of round tripping through systems that use enumerated sets is already present,
  since it is most likely that applications will not have exactly matching sets of
  severities and confidences.)

  For example, I intend to map any Pass values with severity 80%-99% into the
  "pass with unrelated errors (Yb)" category when summarising results.

  (Note: Internally, I store severities and confidences as integers from 0 to 255.
  That is the natural computer equivalent of percentages. I would be quite happy
  if EARL used the integer scale 0..255. I would also be fine with EARL using a
  floating point scale between two arbitrary values, such as 0.0 and 1.0.)

CMN

Exactly my point. You will have an internal mapping of numeric results taht
is specific to your needs, and the next person will too, but theirs will be
slightly different. If you map both of those to the same EARL scheme by
numeric value, you lose the ability to compare between them without doing
complicated searching based on who said what and then doing the mapping - if
you have them as seperate namespaces attached to the comments then you only
need to do the namepsace mapping itself, and you don't get essentially
meaningless EARL data like "this is 66% conformant to a checkpoint that may
or may not ahve a definition of what that means" (your own test setup should
have a definition of numeric values if it is going to generate them...)

  > For some use cases your Yb - pass with unrelated errors would count as a pass,
  > and for some cases it would score as a fail. So we would need to know what
  > they are.

  Assuming "they" refers to the "unrelated errors" then yes, you would; that's
  what the "comments" field is for, presumably.

CMN
Sorry, my bad writing. What I meant was that we need to know what the use
case is, and what the errors are, to categorise whether a result means "only
failed because of dependencies" or "passed the essential bit". But maybe this
isn't necessary if we just say "use a more atomic test". I don't know if that
works in the real world or not.

  > The question also arises as to how many kinds of result we should include in
  > earl and at what point we should leave people to subclass them for their own
  > more detailed uses.

  I think you only need:

      Pass
      Fail
      Not Applicable
      Not Tested

  All the other values I can think of are simply variants of those four result
  types with various values for "Severity" and "Confidence".

  You don't need "Can't Tell" as that should just be "Pass" or "Fail" with
  "Confidence: 0%". (Whether you pick "pass" or "fail" depends on which is the
  "default" -- e.g. in a test where supporting the feature correctly and not even
  trying to support the feature are indistinguishable, you would use "Pass", while
  in a test where trying to support the feature but doing so _incorrectly_ is
  indistinguishable from not supporting the feature at all you would use "Fail".)

CMN Not sure that I agree here. Having a pass and a fail be equivalent under
circumstances that are outside the RDF model we are using seems likely to
lead to problems in processing. If there is a cannotTell then I think it
should be a single defined property. More to the point, what is the use case
for it in the first place?

  Note that "Not Tested" is present only for completeness, as I expect most
  applications would simply not include the result in that case.

CMN agree. At some point when we have some tools we should look to see if
there is a use case that justifies it - otherwise we might as well remove it
as dead weight.

  "Not Applicable" is important for tests that neither pass nor fail, such as a
  test for making sure all images have alternate text, when applied to a document
  with no images, or a test to make sure that 'red' is rendered different from
  'green', on a monochromatic device.

(CMN I agree with your conclusion, but testing whether 'green' and 'red' are
rendered differently on a monochromatic device is in fact important...)

-- 
Charles McCathieNevile    http://www.w3.org/People/Charles  phone: +61 409 134 136
W3C Web Accessibility Initiative     http://www.w3.org/WAI  fax: +33 4 92 38 78 22
Location: 21 Mitchell street FOOTSCRAY Vic 3011, Australia
(or W3C INRIA, Route des Lucioles, BP 93, 06902 Sophia Antipolis Cedex, France)

Received on Monday, 1 July 2002 21:51:30 UTC