- From: Charles McCathieNevile <charles@w3.org>
- Date: Fri, 30 Nov 2001 22:48:17 -0500 (EST)
- To: Al Gilman <asgilman@iamdigex.net>
- cc: <w3c-wai-gl@w3.org>
Sorry folks, but I have to agree with Kynn here "we think that a group of people will agree on this" seems like a particularly subjective measure, and since it is not defined who are the "we" that think this, it doesn't seem to be one that I could reasonably support. For the record I do not necessarily believe that 80% of people will come to the same conclusion on Guideline 1 as written, and cite the rather endless discussions over how to do good alt text as one piece of evidence. Without testing this assertion I don't think it has a lot of value, and it doesn't accord with my experience. I agree with Al that breaking the questions down will make it easier to get agreement on them, and I think that is what the success criteria do for our guidelines. I disagree that if we try to provide percentage ranges we will get closer agreement. Testing is a difficult area, but there is a good reason why most testing uses a scale of no more than 7 plus or minus two possible responses. I would like to propose a different test of whether we think something is sufficiently clear to be a normative requirement. I am aware that this involves more work than the process of guessing how people will decide, and that "they" don't get work done, "we" do, and only by doing it. However, I propose the following five steps: 1. For any requirement we collect test cases, which specify which requirement they are a test case for. 2. Any person can submit a test case, and cases will be accepted unless the working group is convinced that a test case merely duplicates an existing case. 3. If there are at least 12 test cases for a requirement, the working group is asked to decide whether the test case meets or fails the requirement. 4. Until one half of the members in good standing have submitted a judegment on each test case associated with a requirement we are undecided. 5. If one half of the members have submitted judgement, and for each test case more than 80% of the judgements agree, we declare that we have sufficient consensus to take this forward at working draft level. (I would propose that to get to last call we raise the bar somewhere near 95%) Note that I have proposed that if any one test case doesn't get the 80% agreement then the whole requirement is not yet ready. I think this is important. cheers Charles On Fri, 30 Nov 2001, Al Gilman wrote: At 11:38 AM 2001-11-30 , Kynn Bartlett wrote: >At 5:37 PM -0600 11/29/01, Gregg Vanderheiden wrote: >>We began by reviewing the guidelines one at a time to determine whether >>or not: >>1. they met the “80% or better” (80%+) objectivity criterion >> >>For number 1, "Provide a text equivalent for all non-text content”, we >>found: >>• We believed it would pass the “80%+” objective test >> >>For guideline number 2, "Provides synchronized media equivalence for >>time dependent presentations, we found: >>• We believed items 1 and 2 would pass the 80%+ objectivity test > >Now I'm getting even more more weirded out by this "objectivity" >we've embraced. In my last email, I said: > > So 80% of people, who meet subjective criteria for inclusion, > then make subjective determinations, and if they happen to agree, > we label this "objective"? > >Apparently the way we are using our newfound "objectivity" criteria >is as follows: > > A group of people -- who may or may not meet subjective criteria > for inclusion -- "reach consensus" on whether or not they > subjectively believe that at least 80% of an undefined group of > people -- who meet subjective criteria for inclusion -- would > make agreeing subjective determinations on arbitrary undefined > specific applications, ... and we label this "objective?" [snip] AG:: Let me suggest an interpretation of how we might describe the facts as they regard these two guidelines. Preview: The meta-question that I think we are dealing with, here, could be stated "Have we defined a boolean yes/no question which would generate substantial agreement among the results of a pool of reasonably qualified evaluators? To get at what is actually going on, we have break the guidelines down into more fine-grained steps, because the answers for the pieces are different to a significant degree. Guideline 1: Question: Does the content "Provide a text equivalent for all non-text content"? Subquestion 1: Are items of non-text content identifiable? [objective] Subquestion 2: Are text items provided and associated with the non-text items? [objective] Subquestion 3: Are the text items equivalent to the associated non-text items? [judgemental] [snip] In the above two lists of sub-questions, I have marked some as 'judgemental' where in my expectation, the level of agreement on boolean outcomes would be predictably and significantly less than in the case of the sub-questions that I have rated as 'objective.' An added dimension of these questions is that I expect that if the raters had a more differentiated scale with which to express the form and degree of equivalence between the content fragments compared, that a more stable and convincing sort of agreement would be visible in the data than in the case that the raters are forced to express a yes/no answer. In other words, for these sub-questions I believe that by refining the question asked of the evaluators, we could achieve repeatability in the evaluation results, although at the present level of roll-up of the results there could be problems in the repeatability of the results of evaluating with respect to this question. [snip] In particular, I would suggest the following battery of questions as a means of rating a pair A,B of media objects or fragments. There are three ways the question gets subdivided. First, rate B as a substitute for A and independently rate A as a substitute for B. Second, rate breadth of information coverage separately from depth, so that a summary or precis would have the same breadth and less depth than what it is summarizing. Third, ask for percentage coverage answers and not yes/no answers.
Received on Friday, 30 November 2001 22:48:22 UTC