Re: last part of todays telecon from Al Gilman on 2001-11-30 (w3c-wai-gl@w3.org from October to December 2001)

From: Al Gilman <asgilman@iamdigex.net>
Date: Fri, 30 Nov 2001 16:26:12 -0500
To: <w3c-wai-gl@w3.org>
Message-Id: <200111302118.QAA1586835@smtp2.mail.iamworld.net>
At 11:38 AM 2001-11-30 , Kynn Bartlett wrote:
>At 5:37 PM -0600 11/29/01, Gregg Vanderheiden wrote:
>>We began by reviewing the guidelines one at a time to determine whether
>>or not:
>>1. they met the “80% or better” (80%+) objectivity criterion
>>
>>For number 1, "Provide a text equivalent for all non-text content”, we
>>found:
>>• We believed it would pass the  “80%+” objective test
>>
>>For guideline number 2, "Provides synchronized media equivalence for
>>time dependent presentations, we found:
>>• We believed items 1 and 2 would pass the 80%+ objectivity test
>
>Now I'm getting even more more weirded out by this "objectivity"
>we've embraced.  In my last email, I said:
>
>      So 80% of people, who meet subjective criteria for inclusion,
>      then make subjective determinations, and if they happen to agree,
>      we label this "objective"?
>
>Apparently the way we are using our newfound "objectivity" criteria
>is as follows:
>
>      A group of people -- who may or may not meet subjective criteria
>      for inclusion -- "reach consensus" on whether or not they
>      subjectively believe that at least 80% of an undefined group of
>      people -- who meet subjective criteria for inclusion -- would
>      make agreeing subjective determinations on arbitrary undefined
>      specific applications, ... and we label this "objective?"
>
>This is newspeak of the worst kind, folks.  If we want credibility
>for our work, we don't suddenly label as "objective" things which are
>clearly and absolutely subjective.  A subjective decision doesn't
>suddenly become objective if you vote on it.
>
>The specific process you've defined may or may not be useful and I'm
>not suggesting we reject that out of hang -- but if you keep it, you
>MUST rename it to something else OTHER than "objective."
>

AG::

Let me suggest an interpretation of how we might describe the facts as they
regard these two guidelines.

Preview:  The meta-question that I think we are dealing with, here, could be
stated "Have we defined a boolean yes/no question which would generate
substantial agreement among the results of a pool of reasonably qualified
evaluators?

To get at what is actually going on, we have break the guidelines down into
more fine-grained steps, because the answers for the pieces are different to a
significant degree.

Guideline 1:

Question:

Does the content "Provide a text equivalent for all non-text content"?

Subquestion 1: Are items of non-text content identifiable?  [objective]
Subquestion 2: Are text items provided and associated with the non-text items?
[objective]
Subquestion 3: Are the text items equivalent to the associated non-text items?
[judgemental]

The same pattern will recur for Guideline 2.

Question:

Does the content "Provide synchronized media equivalents for time-dependent
presentations"?

Subquestion 1: Are time-dependent presentations identifiable?  [objective]
Subquestion 2: Are parallel items provided and associated with the
time-dependent presentations? [objective]
Subquestion 3: Are the parallel items synchronized to their associated
time-dependent presentations? [objective]
Subquestion 4: Are the parallel items equivalent to their associated
time-dependent presentations? [judgemental]

In the above two lists of sub-questions, I have marked some as 'judgemental'
where in my expectation, the level of agreement on boolean outcomes would be
predictably and significantly less than in the case of the sub-questions
that I
have rated as 'objective.'  An added dimension of these questions is that I
expect that if the raters had a more differentiated scale with which to
express
the form and degree of equivalence between the content fragments compared,
that
a more stable and convincing sort of agreement would be visible in the data
than in the case that the raters are forced to express a yes/no answer.  In
other words, for these sub-questions I believe that by refining the question
asked of the evaluators, we could achieve repeatability in the evaluation
results, although at the present level of roll-up of the results there
could be
problems in the repeatability of the results of evaluating with respect to
this
question.

There is a strong parallel between this scenario and the testing tools applied
by the UPnP Implementers' Corporation in evaluating UPnP implementations for
the purposes of certification.  They have machine-implementable [up through
syntax] tests that are required for certification but are not sufficient to
determine that a service has actually been rendered on an end-to-end basis. 
They also talk about semantic tests, but do not require them for
certification.  It is possible that semantic tests which would determine that
the service asked for was actually rendered are a matter of current
experimentation, but are not part of the certification process as presently
employed.

I think that we can handle the 'newspeak' problems by creating some label,
either 'objective*' or 'expectedToShowConsistentResults' or an opaque token,
with the definition of "consensus gut expectation that reasonably repeatable
results would be obtainable in independent applications of the stated
'test' or
criterion.

But I think that we owe ourselves to move beyond an up-or-down vote at the
guideline level, and explore the frontier between questions that are easy to
dismiss as objective* and those that generate doubt, call them
objectivityInQuestion.  Often one has to subdivide the question to isolate the
point of resistance or pain, but that exercise is useful, as it may expose a
set of propositions that gain comfortable consensus and reveal more focussed
questions meriting further work.

In particular, I would suggest the following battery of questions as a
means of
rating a pair A,B of media objects or fragments.  There are three ways the
question gets subdivided.  First, rate B as a substitute for A and
independently rate A as a substitute for B.  Second, rate breadth of
information coverage separately from depth, so that a summary or precis would
have the same breadth and less depth than what it is summarizing.  Third, ask
for percentage coverage answers and not yes/no answers.

With these elaborations on the question, we would probably find a compelling
level of agreement in the resulting ratings from reasonably qualified
observers.

In other words, take two media object or fragments, and the questions become:

What fraction of the topic of A does B cover (breadth)?  Answer as a
percentage
between 0% and 100%.
How fully does B inform you about what it covers, as compared information
available from A (depth or completeness)? [percentage]
What fraction of the topic of B does A cover (similarly)?
What depth or completeness of coverage does A provide as a percentage of
what B
provides?

In this metric framework, I believe we would achieve a comfortable
repeatability of results across independent trials with the same items to be
evaluated and different evaluators but the same questions asked.

I believe that the objectivity question to first order is "has the question
produced consistent results in independent trials?" and the best we can do
without actually carrying out independent trials is "do we expect this
question
to produce consistent results in independent trials?"  The more useful
question
in my mind is, "If anyone has doubts as to the ability of this question to
elicit consistent and useful answers in independent application, is there some
way to subdivide the question into parts on which we expect consistent results
and parts where we have less faith that consistent results would be
forthcoming?"  And then, if there is such a factorization, work on "is there
another question to apply in place of the shaky one that we feel would produce
consistent and useful results?"

Again, the classic issue is reading level.

It is not readily agreeable to ask "is the reading level of this text
universal?"  But one can reasonably ask "Is the reading level demand posed by
this content reasonably self-consistent?  Is it documented in available
metadata?  Is it reasonable for [defined audience]?  Reading level, if
evaluated by the content generation activity and served as metadata, is useful
in conjunction with 'topic' metadata in the process of user choice among
alternatives discovered on the Web.  So it is not necessary to reduce reading
level to a boolean questions to produce value added to the user in accessing
content that they can use.  Demanding that what we provide to the authoring
community be entirely in the form of boolean questions is
counter-productive in
this area, because it eliminates a productive area of activity on their part.

Al





>--Kynn
>
>-- 
>Kynn Bartlett <kynn@idyllmtn.com>
><http://www.kynn.com/>http://www.kynn.com/
>
Received on Friday, 30 November 2001 16:18:20 UTC