Re: last part of todays telecon from Charles McCathieNevile on 2001-12-01 (w3c-wai-gl@w3.org from October to December 2001)

From: Charles McCathieNevile <charles@w3.org>
Date: Fri, 30 Nov 2001 22:48:17 -0500 (EST)
To: Al Gilman <asgilman@iamdigex.net>
cc: <w3c-wai-gl@w3.org>
Message-ID: <Pine.LNX.4.30.0111302230520.30991-100000@tux.w3.org>
Sorry folks, but I have to agree with Kynn here "we think that a group of
people will agree on this" seems like a particularly subjective measure, and
since it is not defined who are the "we" that think this, it doesn't seem to
be one that I could reasonably support.

For the record I do not necessarily believe that 80% of people will come to
the same conclusion on Guideline 1 as written, and cite the rather endless
discussions over how to do good alt text as one piece of evidence. Without
testing this assertion I don't think it has a lot of value, and it doesn't
accord with my experience.

I agree with Al that breaking the questions down will make it easier to get
agreement on them, and I think that is what the success criteria do for our
guidelines. I disagree that if we try to provide percentage ranges we will
get closer agreement. Testing is a difficult area, but there is a good reason
why most testing uses a scale of no more than 7 plus or minus two possible
responses.

I would like to propose a different test of whether we think something is
sufficiently clear to be a normative requirement. I am aware that this
involves more work than the process of guessing how people will decide, and
that "they" don't get work done, "we" do, and only by doing it. However, I
propose the following five steps:

1. For any requirement we collect test cases, which specify which requirement
they are a test case for.

2. Any person can submit a test case, and cases will be accepted unless the
working group is convinced that a test case merely duplicates an existing
case.

3. If there are at least 12 test cases for a requirement, the working group
is asked to decide whether the test case meets or fails the requirement.

4. Until one half of the members in good standing have submitted a judegment
on each test case associated with a requirement we are undecided.

5. If one half of the members have submitted judgement, and for each test
case more than 80% of the judgements agree, we declare that we have
sufficient consensus to take this forward at working draft level. (I would
propose that to get to last call we raise the bar somewhere near 95%)

Note that I have proposed that if any one test case doesn't get the 80%
agreement then the whole requirement is not yet ready. I think this is
important.

cheers

Charles

On Fri, 30 Nov 2001, Al Gilman wrote:

  At 11:38 AM 2001-11-30 , Kynn Bartlett wrote:
  >At 5:37 PM -0600 11/29/01, Gregg Vanderheiden wrote:
  >>We began by reviewing the guidelines one at a time to determine whether
  >>or not:
  >>1. they met the “80% or better” (80%+) objectivity criterion
  >>
  >>For number 1, "Provide a text equivalent for all non-text content”, we
  >>found:
  >>• We believed it would pass the  “80%+” objective test
  >>
  >>For guideline number 2, "Provides synchronized media equivalence for
  >>time dependent presentations, we found:
  >>• We believed items 1 and 2 would pass the 80%+ objectivity test
  >
  >Now I'm getting even more more weirded out by this "objectivity"
  >we've embraced.  In my last email, I said:
  >
  >      So 80% of people, who meet subjective criteria for inclusion,
  >      then make subjective determinations, and if they happen to agree,
  >      we label this "objective"?
  >
  >Apparently the way we are using our newfound "objectivity" criteria
  >is as follows:
  >
  >      A group of people -- who may or may not meet subjective criteria
  >      for inclusion -- "reach consensus" on whether or not they
  >      subjectively believe that at least 80% of an undefined group of
  >      people -- who meet subjective criteria for inclusion -- would
  >      make agreeing subjective determinations on arbitrary undefined
  >      specific applications, ... and we label this "objective?"
[snip]
  AG::

  Let me suggest an interpretation of how we might describe the facts as they
  regard these two guidelines.

  Preview:  The meta-question that I think we are dealing with, here, could be
  stated "Have we defined a boolean yes/no question which would generate
  substantial agreement among the results of a pool of reasonably qualified
  evaluators?

  To get at what is actually going on, we have break the guidelines down into
  more fine-grained steps, because the answers for the pieces are different to a
  significant degree.

  Guideline 1:

  Question:

  Does the content "Provide a text equivalent for all non-text content"?

  Subquestion 1: Are items of non-text content identifiable?  [objective]
  Subquestion 2: Are text items provided and associated with the non-text items?
  [objective]
  Subquestion 3: Are the text items equivalent to the associated non-text items?
  [judgemental]

[snip]

  In the above two lists of sub-questions, I have marked some as
  'judgemental' where in my expectation, the level of agreement on boolean
  outcomes would be predictably and significantly less than in the case of
  the sub-questions that I have rated as 'objective.' An added dimension of
  these questions is that I expect that if the raters had a more
  differentiated scale with which to express the form and degree of
  equivalence between the content fragments compared, that a more stable
  and convincing sort of agreement would be visible in the data than in the
  case that the raters are forced to express a yes/no answer.  In other
  words, for these sub-questions I believe that by refining the question
  asked of the evaluators, we could achieve repeatability in the evaluation
  results, although at the present level of roll-up of the results there
  could be problems in the repeatability of the results of evaluating with
  respect to this question.
[snip]
  In particular, I would suggest the following battery of questions as a
  means of rating a pair A,B of media objects or fragments.  There are
  three ways the question gets subdivided.  First, rate B as a substitute
  for A and independently rate A as a substitute for B.  Second, rate
  breadth of information coverage separately from depth, so that a summary
  or precis would have the same breadth and less depth than what it is
  summarizing.  Third, ask for percentage coverage answers and not yes/no
  answers.
Received on Friday, 30 November 2001 22:48:22 UTC