- From: Al Gilman <asgilman@iamdigex.net>
- Date: Sat, 08 Dec 2001 15:14:59 -0500
- To: <w3c-wai-gl@w3.org>
At 06:21 PM 2001-12-04 , Gregg Vanderheiden wrote: >Absolutely. Test cases (both selected and random) need to be a key >part of our evaluation process. In fact, procedure I think you are >suggesting is just what has been discussed though not formalized. > >So let's take this opportunity to begin that process. [snip] > >NOTE: the above is a VERY rough description of a procedure as I run to a >meeting. But I would like to see if we can get this ball rolling. >Comments and suggestions welcome. > [rest of quote below] AG:: Let's talk in terms of experiment design, and gathering evidence in support of a Recommendation. In terms of experiment design, it is good to have some sense of the shape of the space we are exploring. I would describe this space as a relation or scatter-plot in which we collect experience with web content. What you would immediately think of as user experience, and guideline experience -- outcomes against the criteria on which we expect to base a recommendation as to usage. Outcomes and prognostics for outcomes. The domain of this relation breaks down into two principal subspaces: content items and delivery contexts. A content item is often a web page but may be a site, a feature within a page, or a pattern of similar features (such as navbars) across all pages on one site. A delivery context is a combination of user and client workstation configuration components that we suspect might make a difference in the outcomes. Therefore assitive technology items and release versions should be documented as well as we can, although we will then be aggregating the results hiding some of these details. The range of this relation contains two kinds of outcomes: technical or prognostic outcomes as defined by the concepts of the working group's working baseline, and more bottom-line measures of how effective the rendering of service is to the user. What we are looking for to be the Recommendation is the least restrictive family of readily-applied technical or prognostic tests that are reasonably necessary and reasonably sufficient for the achievement of good bottom-line results across as broad a range of delivery contexts as we can reasonably achieve. There may be too much 'reasonable' mumble in that for some, but I think when we get down to facts we may have to make some final arm-wrestling agreements as to what appears to be beyond our reach. The main point of my post is to say that we need enough bookkeeping in recording experience so that we can compare not only between applications of the exact same test case across different exposures (different delivery contexts, different test conditions) but also make comparisons of aggregated bottom-line outcomes as collected using different filters based on profiles of technical or prognostic criteria. ** evidence and argument The reason why we should view our collected experience in this way is that in order for the web consumers and the web providers to consensually agree that a given profile of technical or prognostic criteria is a reasonable criterion to make generally applicable, we have to succeed in two distinct arguments. We have to convince the consumers that the profile of criteria is reasonably sufficient to achieve generally good outcomes. We have to convince the producers that the profile is reasonably necessary, that there is no less burdensome profile of criteria which will have comparable success in assuring broad access. This is why, as evidence, we need to ask comparative questions about combinations of critieria. So we need to provide enough structure to how we treat the [candidate success criteria] description of the _prognistic_ criteria so as to be able to compare them as to how they differ in their own space and not only how they differ in their consequences. We need the rudiments of a test case description language or schema, and not just references to points in the draft document. We need a systematic way to generate small variations on a test case in the coordinate frame articulated in the success-criteria language. ** detailed variations Let me introduce two further ramifications for people to think about. One has to do with how do we conduct experiments in the area of principles where we don't have consensus language for a test that we consensually expect to prove to be repeatable in practice. The other has to do with how we gauge how representative our collection of test cases is of the web genre as it is actually used. I do think that for areas of concern such as cognitive disabilities and reading level measures, that we should proceed to do the comparative studies without flogging the horse of debate any further. Where we have failed to consense on a verbalization of a criterion that we expect will be reasonably necessary and sufficient, let us back off to experiment with related measurable properties of content items and proceed to look for frontiers in good vs. bad bottom line outcomes that emerge in the collected experience. This is discerning the frontier of success bottom-up off the data. Since we need to explore the bottom-up evidence that we have drawn the frontier of success by our profile of prognostics to be done anyway, this just means we have to be a little less focused in our initial distribution of test cases. * Do we cover typical web usage? I think that we may want to come up with a little more rough and ready heuristic taxonomy of web genres or idioms, so as to develop some confidence that we are covering the range of capabilities that people count on the web to deliver. These are things like how many primary logical regions are there on a page, links per page, etc. These could help us to ascertain how representative our test cases are as a group. ** click streams There is an open question as to whether we want to instrument the test application context to record what paths the users followed through the content. In terms of a search for priorities, content providers might be inclined to apply priorities which say a fault that breaks a more-used pathway is of a higher priority than a fault that breaks a less-used pathway. Maybe someone else would be able to launder this kind of data to share with us. Some kind of logging might be readily achievable and could be quite informative. There are always going to be users who declare failure because they failed to discover alternatives. This is partly the quality of the informal content, and partly the nut behind the wheel. Maybe we need to have in our test protocol retest where the user is hinted at options that they didn't try the previous time. This is even disclosed in the IUSR CIF format <<http://www.nist.gov/iusr>http://www.nist.gov/iusr> so it may well be a common feature. We have to design the test user's activity like a usability test or we will wind up learning lessons that usability test designers have suffered through already. I myself would much rather have a robot tracking where I actually went than have to manually enter all that information. The more we limit manual entry to subjective and bottom-line assessments, the easier it will be to have complete and objective data, and the more ground a tester can cover in a finite span of time. Interpreting someone's bottom-line generalizations may be hard to interpret without this. For example, I can conjecture that Bob never went to the color-blindness page, but only because I did. <http://lists.w3.org/Archives/Public/w3c-wai-ig/2001OctDec/0800.html>http: //lists.w3.org/Archives/Public/w3c-wai-ig/2001OctDec/0800.html Al > >Let me pose the following to begin discussion. > > >1 - create a collection of representative (as much as there is such a >thing) pages or sites that sample the RANGE of different pages, >approaches and technologies on the Web. >2 - look at the items (particularly success criteria) - identify any >additional sample pages or sites needed to explore the item (if sample >is not good enough to) >3 - run quick tests by team members with these stimuli to see if >agreement. If team agrees that it fails, work on it. If it passes team >or is ambiguous then test move on to testing with external sample of >people while fixing any problems identified in the internal screening >test. >4 - proceed in this manner to keep improving items and learning about >objectivity or agreement as we move toward the final version and final >testing. >5 - in parallel with the above, keep looking at the items with the >knowledge we acquire and work to make items stronger > > >The key to this is the Test Case Page Collection. We have talked about >this. But no one has stepped forward to help build it. Can we form a >side team to work on this? > > > >NOTE: the above is a VERY rough description of a procedure as I run to a >meeting. But I would like to see if we can get this ball rolling. >Comments and suggestions welcome. > >Gregg > >-- ------------------------------ >Gregg C Vanderheiden Ph.D. >Professor - Human Factors >Dept of Ind. Engr. - U of Wis. >Director - Trace R & D Center >Gv@trace.wisc.edu <<mailto:Gv@trace.wisc.edu>mailto:Gv@trace.wisc.edu>, <<http://trace.wisc.edu/>http://trace.wisc.edu/> >FAX 608/262-8848 >For a list of our listserves send “lists” to listproc@trace.wisc.edu ><<mailto:listproc@trace.wisc.edu>mailto:listproc@trace.wisc.edu> > > >-----Original Message----- >From: Charles McCathieNevile [<mailto:charles@w3.org%5D>mailto:charles@w3.org] > Subject: Re: "objective" clarified > ><snip> > >I think that for an initial assessment the threshold of 80% is fine, and >I >think that as we get closer to making this a final version we should be >lifting that requirement to about 90 or 95%. However, I don't think that >it >is very useful to think about whether people would agree in the absence >of >test cases. There are some things where it is easy to describe the test >in >operational terms. There are others where it is difficult to descibe the >test >in operational terms, but it is easy to get substantial agreement. (The >famous "I don't know how to define illustration, but I recognise it when >I >see it" explanation). > >It seems to me that the time spent in trying to imagine whether we would >agree on a test would be more usefully spent in generating test cases, >which >we can thenuse to very quickly find out if we agree or not. The added >value >is that we then have those available as examples to show people - when >it >comes to people being knowledgeable of the tests and techniques they >will >have the head start of having seen real examples and what the working >group >thought about them as an extra guide. > > ><snip> >
Received on Saturday, 8 December 2001 15:05:09 UTC