Re: Test Cases (distribution of; experiment design) from Al Gilman on 2001-12-08 (w3c-wai-gl@w3.org from October to December 2001)

From: Al Gilman <asgilman@iamdigex.net>
Date: Sat, 08 Dec 2001 15:14:59 -0500
To: <w3c-wai-gl@w3.org>
Message-Id: <200112082005.PAA1975878@smtp2.mail.iamworld.net>
At 06:21 PM 2001-12-04 , Gregg Vanderheiden wrote:
>Absolutely.   Test cases (both selected and random) need to be a key
>part of our evaluation process.  In fact, procedure I think you are
>suggesting is just what has been discussed though not formalized. 
>
>So let's take this opportunity to begin that process. 

[snip]

>
>NOTE: the above is a VERY rough description of a procedure as I run to a
>meeting.   But I would like to see if we can get this ball rolling.
>Comments and suggestions welcome.    
>

[rest of quote below]

AG::

Let's talk in terms of experiment design, and gathering evidence in support of
a Recommendation.

In terms of experiment design, it is good to have some sense of the shape of
the space we are exploring.  I would describe this space as a relation or
scatter-plot in which we collect experience with web content.  What you would
immediately think of as user experience, and guideline experience -- outcomes
against the criteria on which we expect to base a recommendation as to usage. 
Outcomes and prognostics for outcomes.

The domain of this relation breaks down into two principal subspaces: content
items and delivery contexts.  A content item is often a web page but may be a
site, a feature within a page, or a pattern of similar features (such as
navbars) across all pages on one site.  A delivery context is a combination of
user and client workstation configuration components that we suspect might
make
a difference in the outcomes.  Therefore assitive technology items and release
versions should be documented as well as we can, although we will then be
aggregating the results hiding some of these details.  The range of this
relation contains two kinds of outcomes: technical or prognostic outcomes as
defined by the concepts of the working group's working baseline, and more
bottom-line measures of how effective the rendering of service is to the user.

What we are looking for to be the Recommendation is the least restrictive
family of readily-applied technical or prognostic tests that are reasonably
necessary and reasonably sufficient for the achievement of good bottom-line
results across as broad a range of delivery contexts as we can reasonably
achieve.

There may be too much 'reasonable' mumble in that for some, but I think
when we
get down to facts we may have to make some final arm-wrestling agreements
as to
what appears to be beyond our reach.

The main point of my post is to say that we need enough bookkeeping in
recording experience so that we can compare not only between applications of
the exact same test case across different exposures (different delivery
contexts, different test conditions) but also make comparisons of aggregated
bottom-line outcomes as collected using different filters based on profiles of
technical or prognostic criteria.

** evidence and argument

The reason why we should view our collected experience in this way is that in
order for the web consumers and the web providers to consensually agree that a
given profile of technical or prognostic criteria is a reasonable criterion to
make generally applicable, we have to succeed in two distinct arguments.  We
have to convince the consumers that the profile of criteria is reasonably
sufficient to achieve generally good outcomes.  We have to convince the
producers that the profile is reasonably necessary, that there is no less
burdensome profile of criteria which will have comparable success in assuring
broad access.  

This is why, as evidence, we need to ask comparative questions about
combinations of critieria.  So we need to provide enough structure to how we
treat the [candidate success criteria] description of the _prognistic_
criteria
so as to be able to compare them as to how they differ in their own space and
not only how they differ in their consequences.  We need the rudiments of a
test case description language or schema, and not just references to points in
the draft document.  We need a systematic way to generate small variations
on a
test case in the coordinate frame articulated in the success-criteria
language.

** detailed variations

Let me introduce two further ramifications for people to think about.

One has to do with how do we conduct experiments in the area of principles
where we don't have consensus language for a test that we consensually expect
to prove to be repeatable in practice.  The other has to do with how we gauge
how representative our collection of test cases is of the web genre as it is
actually used.

I do think that for areas of concern such as cognitive disabilities and
reading
level measures, that we should proceed to do the comparative studies without
flogging the horse of debate any further.  Where we have failed to consense on
a verbalization of a criterion that we expect will be reasonably necessary and
sufficient, let us back off to experiment with related measurable
properties of
content items and proceed to look for frontiers in good vs. bad bottom line
outcomes that emerge in the collected experience.  This is discerning the
frontier of success bottom-up off the data.  Since we need to explore the
bottom-up evidence that we have drawn the frontier of success by our
profile of
prognostics to be done anyway, this just means we have to be a little less
focused in our initial distribution of test cases.

* Do we cover typical web usage?

I think that we may want to come up with a little more rough and ready
heuristic taxonomy of web genres or idioms, so as to develop some confidence
that we are covering the range of capabilities that people count on the web to
deliver.

These are things like how many primary logical regions are there on a page,
links per page, etc.  These could help us to ascertain how representative our
test cases are as a group.

** click streams

There is an open question as to whether we want to instrument the test
application context to record what paths the users followed through the
content.  In terms of a search for priorities, content providers might be
inclined to apply priorities which say a fault that breaks a more-used pathway
is of a higher priority than a fault that breaks a less-used pathway.  Maybe
someone else would be able to launder this kind of data to share with us.
Some
kind of logging might be readily achievable and could be quite informative. 
There are always going to be users who declare failure because they failed to
discover alternatives.  This is partly the quality of the informal content,
and
partly the nut behind the wheel.  Maybe we need to have in our test protocol
retest where the user is hinted at options that they didn't try the previous
time.  This is even disclosed in the IUSR CIF format
<<http://www.nist.gov/iusr>http://www.nist.gov/iusr> so it may well be a
common
feature.  We have to design the test user's activity like a usability test or
we will wind up learning lessons that usability test designers have suffered
through already.

I myself would much rather have a robot tracking where I actually went than
have to manually enter all that information.  The more we limit manual
entry to
subjective and bottom-line assessments, the easier it will be to have complete
and objective data, and the more ground a tester can cover in a finite span of
time.  Interpreting someone's bottom-line generalizations may be hard to
interpret without this.  For example, I can conjecture that Bob never went to
the color-blindness page, but only because I did.

 <http://lists.w3.org/Archives/Public/w3c-wai-ig/2001OctDec/0800.html>http:
//lists.w3.org/Archives/Public/w3c-wai-ig/2001OctDec/0800.html

Al

>
>Let me pose the following to begin discussion.
>
>
>1  -  create a collection of representative (as much as there is such a
>thing) pages or sites that sample the RANGE of different pages,
>approaches and technologies on the Web.
>2 - look at the items (particularly success criteria)  -  identify any
>additional sample pages or sites needed to explore the item (if sample
>is not good enough to)
>3 -  run quick tests by team members with these stimuli to see if
>agreement.  If team agrees that it fails, work on it.  If it passes team
>or is ambiguous then test move on to testing with external sample of
>people while fixing any problems identified in the internal screening
>test. 
>4 -  proceed in this manner to keep improving items and learning about
>objectivity or agreement as we move toward the final version and final
>testing.
>5 -  in parallel with the above, keep looking at the items with the
>knowledge we acquire and work to make items stronger
>
>
>The key to this is the Test Case Page Collection.  We have talked about
>this.  But no one has stepped forward to help build it.   Can we form a
>side team to work on this?
>
>
>
>NOTE: the above is a VERY rough description of a procedure as I run to a
>meeting.   But I would like to see if we can get this ball rolling.
>Comments and suggestions welcome.    
>
>Gregg
>
>-- ------------------------------ 
>Gregg C Vanderheiden Ph.D. 
>Professor - Human Factors 
>Dept of Ind. Engr. - U of Wis. 
>Director - Trace R & D Center 
>Gv@trace.wisc.edu <<mailto:Gv@trace.wisc.edu>mailto:Gv@trace.wisc.edu>,
<<http://trace.wisc.edu/>http://trace.wisc.edu/> 
>FAX 608/262-8848  
>For a list of our listserves send “lists” to listproc@trace.wisc.edu
><<mailto:listproc@trace.wisc.edu>mailto:listproc@trace.wisc.edu> 
>
>
>-----Original Message-----
>From: Charles McCathieNevile
[<mailto:charles@w3.org%5D>mailto:charles@w3.org] 
> Subject: Re: "objective" clarified
>
><snip>
>
>I think that for an initial assessment the threshold of 80% is fine, and
>I
>think that as we get closer to making this a final version we should be
>lifting that requirement to about 90 or 95%. However, I don't think that
>it
>is very useful to think about whether people would agree in the absence
>of
>test cases. There are some things where it is easy to describe the test
>in
>operational terms. There are others where it is difficult to descibe the
>test
>in operational terms, but it is easy to get substantial agreement. (The
>famous "I don't know how to define illustration, but I recognise it when
>I
>see it" explanation).
>
>It seems to me that the time spent in trying to imagine whether we would
>agree on a test would be more usefully spent in generating test cases,
>which
>we can thenuse to very quickly find out if we agree or not. The added
>value
>is that we then have those available as examples to show people - when
>it
>comes to people being knowledgeable of the tests and techniques they
>will
>have the head start of having seen real examples and what the working
>group
>thought about them as an extra guide.
>  
>
><snip>
>
Received on Saturday, 8 December 2001 15:05:09 UTC