- From: Al Gilman <asgilman@iamdigex.net>
- Date: Fri, 30 Nov 2001 16:26:12 -0500
- To: <w3c-wai-gl@w3.org>
At 11:38 AM 2001-11-30 , Kynn Bartlett wrote: >At 5:37 PM -0600 11/29/01, Gregg Vanderheiden wrote: >>We began by reviewing the guidelines one at a time to determine whether >>or not: >>1. they met the “80% or better” (80%+) objectivity criterion >> >>For number 1, "Provide a text equivalent for all non-text content”, we >>found: >>• We believed it would pass the “80%+” objective test >> >>For guideline number 2, "Provides synchronized media equivalence for >>time dependent presentations, we found: >>• We believed items 1 and 2 would pass the 80%+ objectivity test > >Now I'm getting even more more weirded out by this "objectivity" >we've embraced. In my last email, I said: > > So 80% of people, who meet subjective criteria for inclusion, > then make subjective determinations, and if they happen to agree, > we label this "objective"? > >Apparently the way we are using our newfound "objectivity" criteria >is as follows: > > A group of people -- who may or may not meet subjective criteria > for inclusion -- "reach consensus" on whether or not they > subjectively believe that at least 80% of an undefined group of > people -- who meet subjective criteria for inclusion -- would > make agreeing subjective determinations on arbitrary undefined > specific applications, ... and we label this "objective?" > >This is newspeak of the worst kind, folks. If we want credibility >for our work, we don't suddenly label as "objective" things which are >clearly and absolutely subjective. A subjective decision doesn't >suddenly become objective if you vote on it. > >The specific process you've defined may or may not be useful and I'm >not suggesting we reject that out of hang -- but if you keep it, you >MUST rename it to something else OTHER than "objective." > AG:: Let me suggest an interpretation of how we might describe the facts as they regard these two guidelines. Preview: The meta-question that I think we are dealing with, here, could be stated "Have we defined a boolean yes/no question which would generate substantial agreement among the results of a pool of reasonably qualified evaluators? To get at what is actually going on, we have break the guidelines down into more fine-grained steps, because the answers for the pieces are different to a significant degree. Guideline 1: Question: Does the content "Provide a text equivalent for all non-text content"? Subquestion 1: Are items of non-text content identifiable? [objective] Subquestion 2: Are text items provided and associated with the non-text items? [objective] Subquestion 3: Are the text items equivalent to the associated non-text items? [judgemental] The same pattern will recur for Guideline 2. Question: Does the content "Provide synchronized media equivalents for time-dependent presentations"? Subquestion 1: Are time-dependent presentations identifiable? [objective] Subquestion 2: Are parallel items provided and associated with the time-dependent presentations? [objective] Subquestion 3: Are the parallel items synchronized to their associated time-dependent presentations? [objective] Subquestion 4: Are the parallel items equivalent to their associated time-dependent presentations? [judgemental] In the above two lists of sub-questions, I have marked some as 'judgemental' where in my expectation, the level of agreement on boolean outcomes would be predictably and significantly less than in the case of the sub-questions that I have rated as 'objective.' An added dimension of these questions is that I expect that if the raters had a more differentiated scale with which to express the form and degree of equivalence between the content fragments compared, that a more stable and convincing sort of agreement would be visible in the data than in the case that the raters are forced to express a yes/no answer. In other words, for these sub-questions I believe that by refining the question asked of the evaluators, we could achieve repeatability in the evaluation results, although at the present level of roll-up of the results there could be problems in the repeatability of the results of evaluating with respect to this question. There is a strong parallel between this scenario and the testing tools applied by the UPnP Implementers' Corporation in evaluating UPnP implementations for the purposes of certification. They have machine-implementable [up through syntax] tests that are required for certification but are not sufficient to determine that a service has actually been rendered on an end-to-end basis. They also talk about semantic tests, but do not require them for certification. It is possible that semantic tests which would determine that the service asked for was actually rendered are a matter of current experimentation, but are not part of the certification process as presently employed. I think that we can handle the 'newspeak' problems by creating some label, either 'objective*' or 'expectedToShowConsistentResults' or an opaque token, with the definition of "consensus gut expectation that reasonably repeatable results would be obtainable in independent applications of the stated 'test' or criterion. But I think that we owe ourselves to move beyond an up-or-down vote at the guideline level, and explore the frontier between questions that are easy to dismiss as objective* and those that generate doubt, call them objectivityInQuestion. Often one has to subdivide the question to isolate the point of resistance or pain, but that exercise is useful, as it may expose a set of propositions that gain comfortable consensus and reveal more focussed questions meriting further work. In particular, I would suggest the following battery of questions as a means of rating a pair A,B of media objects or fragments. There are three ways the question gets subdivided. First, rate B as a substitute for A and independently rate A as a substitute for B. Second, rate breadth of information coverage separately from depth, so that a summary or precis would have the same breadth and less depth than what it is summarizing. Third, ask for percentage coverage answers and not yes/no answers. With these elaborations on the question, we would probably find a compelling level of agreement in the resulting ratings from reasonably qualified observers. In other words, take two media object or fragments, and the questions become: What fraction of the topic of A does B cover (breadth)? Answer as a percentage between 0% and 100%. How fully does B inform you about what it covers, as compared information available from A (depth or completeness)? [percentage] What fraction of the topic of B does A cover (similarly)? What depth or completeness of coverage does A provide as a percentage of what B provides? In this metric framework, I believe we would achieve a comfortable repeatability of results across independent trials with the same items to be evaluated and different evaluators but the same questions asked. I believe that the objectivity question to first order is "has the question produced consistent results in independent trials?" and the best we can do without actually carrying out independent trials is "do we expect this question to produce consistent results in independent trials?" The more useful question in my mind is, "If anyone has doubts as to the ability of this question to elicit consistent and useful answers in independent application, is there some way to subdivide the question into parts on which we expect consistent results and parts where we have less faith that consistent results would be forthcoming?" And then, if there is such a factorization, work on "is there another question to apply in place of the shaky one that we feel would produce consistent and useful results?" Again, the classic issue is reading level. It is not readily agreeable to ask "is the reading level of this text universal?" But one can reasonably ask "Is the reading level demand posed by this content reasonably self-consistent? Is it documented in available metadata? Is it reasonable for [defined audience]? Reading level, if evaluated by the content generation activity and served as metadata, is useful in conjunction with 'topic' metadata in the process of user choice among alternatives discovered on the Web. So it is not necessary to reduce reading level to a boolean questions to produce value added to the user in accessing content that they can use. Demanding that what we provide to the authoring community be entirely in the form of boolean questions is counter-productive in this area, because it eliminates a productive area of activity on their part. Al >--Kynn > >-- >Kynn Bartlett <kynn@idyllmtn.com> ><http://www.kynn.com/>http://www.kynn.com/ >
Received on Friday, 30 November 2001 16:18:20 UTC