Re: Methodology excluding interpretation? from Shadi Abou-Zahra on 2011-09-22 (public-wai-evaltf@w3.org from September 2011)

From: Shadi Abou-Zahra <shadi@w3.org>
Date: Thu, 22 Sep 2011 10:50:04 +0200
To: Detlev Fischer <fischer@dias.de>
CC: public-wai-evaltf@w3.org
Message-ID: <4E7AF6BC.6060101@w3.org>
Hi Detlev, All,

I agree that a successful "methodology" needs to by accompanied by 
robust "tests" but I think these pieces do not necessarily have to be 
developed at the same time or even by the same group. It is important 
for this group to define the "interface" between these two pieces (in 
other words, how these "tests" should look like and function).

I also encourage people from this group to contribute to the upcoming 
RDWG online symposium on "web accessibility metrics". It is actually 
designed to inform the development of future "tests":
  - <http://www.w3.org/WAI/RD/2011/metrics/>

Best,
   Shadi


On 22.9.2011 08:55, Detlev Fischer wrote:
> Hi Eric,
>
> I can see why you would rather excude interpretation (i.e., should a
> particular real life instance be considered a PASS or a FAIL when tested
> against a particular SC) in the method to be developed. It certainly
> makes the description of steps simpler and cleaner.
>
> However, if the whole purpose of the methodology is that we want to
> arrive at a rating per SC (whether 'degree of conformance' or just PASS
> / FAIL), giving *no assistance at all* for the assessment of real world
> instances just means that we open the door to the widest possible array
> of judgements / rating results.
>
> The collection of typical, or model, cases to support reliable rating
> has been brought up here before. I personally think this is the real
> 'meat' of the methodology. I also think that in line with the
> technology-neutral and community-oriented approach of WCAG, such a case
> collection per SC has to be contionuously checked and amended in a
> consensus of experts, as new conforming techniques are invented by
> developers or become available thanks to new technologies such as HTML5.
> Part of that will be to exercise the community function described in
> WCAG for the assessment of whether a particular technology / technique
> can be deemed sufficiently 'accessibility supported' in its context of use.
>
> That is why I believe that the Carter image exercise could be quite
> instructive for us, at least if we should agree that rating case
> collections per SC are indeed a useful part of the methodology. I
> personally cannot see how a methodology could even come close to being
> reliable or approach replicability without such a collection of model
> cases. The examples in the techniqwues are an obvious starting point,
> but negative examples will also be needed, plus examples for
> implementations that 'miss the mark' without being a clear failure. Why?
> Because these are the most difficult cases where testers will disagree
> the most often.
>
> Regards,
> Detlev
>
>
> Am 21.09.2011 23:01, schrieb Velleman, Eric:
>> Hi Detlev,
>>
>> Although I like the approach you take, this is in my opinion not part of
>> the Methodology but input for the WCAG group. We will describe the steps
>> to take when evaluating a website, but not how to interprete the
>> guidelines
>> and tests. We can send comments and input to the other WG's if we see
>> potential problems in interpretation as they can seriously influence the
>> results of two people doing an evaluation. These differences could occur
>> during the cross checking and testing of the methodology on a webpage
>> and/or website.
>>
>> My proposal would be to offer a section on the need and advantages of
>> cross checking and quality control.
>> Kindest regards,
>>
>> Eric
>>
>> ________________________________________
>> Van: public-wai-evaltf-request@w3.org
>> [public-wai-evaltf-request@w3.org] namens Detlev Fischer
>> [fischer@dias.de]
>> Verzonden: maandag 19 september 2011 21:16
>> Aan: public-wai-evaltf@w3.org
>> Onderwerp: Carter image - quick alt text exercise - do not reply to list
>>
>> Hi Michael, hi all,
>>
>> I propose to turn this into a little exercise, not meant to be a
>> competition who's the best rater/tester by all means - just meant to
>> get an idea whether we might arrive at similar - reliable - results.
>>
>> This is what I have in mind. Go here (for some reason I have replaced
>> Clinton with Carter):
>> http://en.wikipedia.org/wiki/Jimmy_Carter
>>
>> Look at Michael's suggestion below: Is the text short enough? Correct?
>> Descriptive enough? (We can probably drop 'translated correctly' here.)
>>
>> The right side bar has an image of Carter with an empty alt attribute.
>> Below I suggest 12 different alt texts and would like you, just for
>> fun, to rate each alt text proposition for conformance to SC 1.1.1,
>> just using TRUE (T) or FALSE (F) for "conforming" or "not conforming".
>>
>> Please send your answers not to the list but to fischer@dias.de or
>> df@oturn.net, and I will process and report on the results when I have
>> input from all those in EVAL TF who want to participate.
>>
>> All text in quotation marks is the value of the alt attribute of that
>> image. Since the context of the site might influence the assessment of
>> what is appropriate, let's just assume we are evaluating this
>> Wikipedia entry.
>>
>> 1) "Photo of Jimmy Carter, former President of the United States" --
>> Rate:
>> 2) "Jimmy Carter, 38. President of the United States" -- Rate:
>> 3) "Jimmy Carter, former American President" -- Rate:
>> 4) "Jimmy Carter, 2002 Nobel peace prize winner" -- Rate:
>> 5) "Jimmy Carter" -- Rate:
>> 6) "James Earl Carter" -- Rate:
>> 7) "President Carter" -- Rate:
>> 8) "US President Carter" -- Rate:
>> 9) "The 39. President of the U.S." -- Rate:
>> 10) "J. Carter, former Governor of Georgia" -- Rate:
>> 11) "JimmyCarterPortrait2.jpg" -- Rate:
>> 12) "" -- Rate:
>>
>> I look forward to your responses. Please do not reply to the list so
>> you do not influence other respondents.
>>
>> Best regards, and have fun!
>> Detlev
>>
>> Quoting Michael S Elledge<elledge@msu.edu>:
>>
>>> I agree with Kerstin that Objectivity is important enough to warrant
>>> its own requirement, but perhaps for different reasons.
>>>
>>> To me, reliability and objectivity are two different concepts.
>>> Reliability means that the methodology (I like to think of
>>> methodology as the implementation of a test, so if it's easier,
>>> substitute "test" for "methodology") will return the same result
>>> each time. I use the "evaluating alt text method" and each time it
>>> gives the same result, no matter who uses it, for the same image.
>>> For example, Kerstin and I evaluate alt text for a certain image on
>>> a website, and the method tells us the alt text is "photo of Bill
>>> Clinton, former President of the United States."
>>>
>>> Objectivity, however, is the part of the methodology that ensures
>>> that the answer Kerstin and I get is not open to interpretation.
>>> Does it provide the information we need to be sure something is
>>> accessible or not.
>>>
>>> This is probably where success criteria come in. Without looking
>>> them up, the success criteria may be that the alt text is short
>>> enough, correct, translated properly, descriptive enough, etc. If we
>>> define the length of the alt text as not to exceed 120 characters or
>>> 20 words, this example would pass the "short enough" criteria. If it
>>> is true, it would pass the "correct" criteria. If it has been
>>> translated properly (say, from French), it would pass the
>>> "translated properly" criteria. If it provides additional
>>> information that both reflects the intention of the web designer who
>>> included the image and adds context, then it passes the "descriptive
>>> enough" criteria.
>>>
>>> I think this applies whether we are talking about applying
>>> methodology to a single instance (an image), or an entire web page.
>>> In either case, we are looking for a methodology that is both
>>> reliable in the results it gives us, and gives us objective results
>>> that are not subject to interpretation.
>>>
>>> How to ensure that a methodology will be both reliable and objective
>>> for a sample of pages is another question, which I will have to give
>>> more thought to, although Kerstin has given us a good start.
>>>
>>> Mike
>>>
>>> On 9/16/2011 4:52 AM, Kerstin Probiesch wrote:
>>>> Hi all,
>>>>
>>>> one could think that Objectivity is included in Reliability, because a
>>>> non-objective test is not reliable and in consequence also not valide.
>>>> In the worst case a test would not measure what we want to measure. I
>>>> wrote about that some mails ago.
>>>>
>>>> As Objectivity is an important concept I think it is not only
>>>> necessary but essential to have it as own Requirement guided by an
>>>> explanation about biases which can influence the result of a test.
>>>> Every testing procedure can produce it's own violations against
>>>> Objectivity. As I see it there are three alternatives for an
>>>> Evaluation Methodology:
>>>>
>>>> - Testing every SC on every page
>>>> - Testing a sample of X pages
>>>> - Testing a sample of X pages _and_ for those SCs which are not
>>>> violated testing those SCs on other pages and parts of the website
>>>>
>>>> Leave for a moment feasiblity, pragmatism behind and just have a look
>>>> at Objectivity and Reliability.
>>>>
>>>> 1. Testing every SC on every page
>>>> A tester checks all SCs on every page. After he/she gives the
>>>> protocol to an independant second tester with the same qualification.
>>>> The second tester can find out if every SC on every page was checked.
>>>> If not, there are two possibilties: A. the first tester overlooked
>>>> something. or B. he has not overlooked something, but the second
>>>> tester comes to another result.
>>>>
>>>> One reason for A can be a measurement error, probably the tool was not
>>>> the right one for this test or is buggy. Measurement errors like this
>>>> don't belong to Objectivity or Reliability.
>>>>
>>>> Some other reasons for A can be: the tester really overlooked
>>>> something ("measurement error") or and know "Objectivity" comes in:
>>>> the tester may think: Well the Layout is nice, so the Accessibility is
>>>> also nice or "I know the web agency, they are doing a good work, so
>>>> there is no need to have a look on all SCs on every page". A bias may
>>>> also be: "not enough time". More a possible - I suggest to collect
>>>> them in a document for every alternative
>>>>
>>>> And It is very brief, probably it's clearer to imagine for those two
>>>> steps a group of testers and a group as second testers.
>>>>
>>>> Let's have a look on B. Both checked every SC on every page, but not
>>>> with the same result. This is a question of Reliability as long as
>>>> there was not a measurement error (buggy tool) and of course is
>>>> influenced by pass/fail or score and score also be a bias.
>>>>
>>>> It is really complex and what I'm writing are some considerations
>>>> about these things. If I would have the perfect test, than there would
>>>> be no need for discussions.
>>>>
>>>> 2. Testing a sample of X pages
>>>> For testing a sample of X pages a preliminary proceeding is needed.
>>>> During this a tester will not only have the collection of pages in
>>>> mind but will also have a look at the SCs - I think it's an illusion
>>>> that a tester will only look, wether a page is typical/common or not.
>>>> Even if there will be no protocoll for the preliminary proceeding, he
>>>> will have this "protocoll" in mind.
>>>>
>>>> After this proceeding he has to decide which pages should be checked.
>>>> This is a very critical point, because on this point a tester can
>>>> influence the result in both directions. Of course this shouldn't
>>>> happen, but a tester is not a Buddha. One additional critical point at
>>>> this stage is the amount of pages. The more pages, the less
>>>> possibility for influencing a result is given.
>>>>
>>>> This sound not nice I know, but we have to speak about this and should
>>>> not do as if we are the already mentioned Buddhas. There can be a lot
>>>> of biases on this point:
>>>>
>>>> - It's a nice layout, so probably the tester don't look very deeply or
>>>> the tester don't like the layout...
>>>> - the tester knows the web developer (he/she is a friend of mine) or
>>>> he/she don't like the web developer
>>>>
>>>> and so on
>>>>
>>>> Even the character of the website can influence the selection of the
>>>> amount of X pages:
>>>>
>>>> - it's the website of a political party the tester likes/or don't
>>>> like or the second tester likes or don't like
>>>> - it's an organisation which a tester supports or don't support
>>>>
>>>> Even the personal attitude towards some probable barriers could be a
>>>> bias.
>>>>
>>>> Another critical issue is what I have written in another mail: The
>>>> risk that a website owner counted the points/percents of passes/fails
>>>> before the test.
>>>>
>>>> If the second tester or the second group of testers will come to
>>>> another result and it has consequence for the question accessible/not
>>>> accessible it could easily be a question of Objectivity. I'm not sure
>>>> if this is really controllable.
>>>>
>>>> If the second independant tester comes to another result, when checked
>>>> the same pages, it's a question of reliability.If the second testers
>>>> come to same result even if they have checked other (content) pages,
>>>> we have a good degree of Objectivity and Reliability. If the second
>>>> tester/s are coming to a near-by result especially when they have
>>>> checked other content pages it's the question of the metrics.
>>>>
>>>> How can we controll that when a web page contains videos that those
>>>> videos will be checked? How can we controll if the tester didn't check
>>>> videos, wether he/she didn't do so because he really overlooked them
>>>> or any other reasons?
>>>>
>>>> And we also shouldn't forget the Levels A, AA, AAA...
>>>>
>>>> The more the Methodology leaves pass/fail (and maybe near-by which is
>>>> a question of the Metrics) and the less pages will be checked the more
>>>> critical it is and the more uncontrolled is the test.
>>>>
>>>> 3. Testing a sample of X pages _and_ those SCs which are not already
>>>> violated might be good way. There would be place for what the tester
>>>> found in the preliminary proceeding and for those SCs which are not
>>>> already violated. Even in this case there can be Violations against
>>>> Objectivity (and of course Reliaiblity). I haven't considered about
>>>> this alternative very deeply, but it could be a good way.
>>>>
>>>> - The first Alternative is good for small websites but not very
>>>> comfortable for huge websites and we need for every test some more
>>>> testers to controll the biases. It might be the best and most
>>>> uncontrollable Methodology. But costs a lot of time.
>>>> - The second Alternative is good for websites when we have a 1:1
>>>> (amount of pages or less pages:amount of selected pages for the test)
>>>> and probably for Processes but I fear it closes the door to Validity
>>>> with the Conformance Requirements (even for smaller pages) and there
>>>> is a high risk for Objectivity and Reliability also. And to fail all
>>>> three Criteria is even more possible when we have a nearly
>>>> uncontrollable test design *and * are adding Tolerance/Metrrics.
>>>>
>>>> I would suggest spending some time in discussing the Third Alternative
>>>> as a mix of page-centered and problem-centered is a good way.
>>>> Especially in the light of Conformance Level (met in full, satisfies
>>>> all the Level A Success Criteria). No test design is 100% but probably
>>>> this alternative guided by documents (how to test, definition of the
>>>> metrics, independant second tester or better group of testers)
>>>>
>>>> I agree with Eric that we need a process for testing the Methodology.
>>>>
>>>>
>>>> Best
>>>>
>>>> Kerstin
>>>>
>>>> 2011/9/15 Michael S Elledge<elledge@msu.edu>:
>>>>> Hi all--
>>>>>
>>>>> I'm not sure what is meant by a controlled test design. Is this the
>>>>> same as
>>>>> a test protocol?
>>>>>
>>>>> Also, when we are talking about objectivity, are we saying that a
>>>>> method
>>>>> must lead to an unbiased result, that the reviewer must be
>>>>> unbiased, our
>>>>> criteria are not subjective, or all three?
>>>>>
>>>>> A bit confused.
>>>>>
>>>>> Mike
>>>>>
>>>>> On Sep 15, 2011, at 4:24 AM, Kerstin
>>>>> Probiesch<k.probiesch@googlemail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi Derlev, all,
>>>>>>
>>>>>> because one can not be sure about 100 percent objectivity a Test
>>>>>> Design
>>>>>> should be a controlled test design. In our case - we haven't
>>>>>> decided about
>>>>>> the Approach - this can happen for example over the amount of
>>>>>> pages or the
>>>>>> amount of pages per SC. Also with other Deskriptions for Testing
>>>>>> Procedures.
>>>>>>
>>>>>> Best
>>>>>>
>>>>>> Kerstin
>>>>>>
>>>>>> Via Mobile
>>>>>>
>>>>>> Am 15.09.2011 um 07:39 schrieb Detlev Fischer<fischer@dias.de>:
>>>>>>
>>>>>>> Quoting Kerstin Probiesch<k.probiesch@googlemail.com>:
>>>>>>>
>>>>>>>> Central question:
>>>>>>>>
>>>>>>>> Do we want that a tester can manipulate the results?
>>>>>>> DF: of course not, but this cannot be ensured by objectivity
>>>>>>> (whatever
>>>>>>> that would mean in practice) but only by some measure of quality
>>>>>>> control: a
>>>>>>> second tester or independent verification of results (also,
>>>>>>> verification of
>>>>>>> the adequacy of the page sample)
>>>>>>>> I don't mean the case that something was overlooked but the case
>>>>>>>> that
>>>>>>>> something was willingly overlooked. Or the other Way round.
>>>>>>> DF: Well, if someone wants to distort results there will probably
>>>>>>> always
>>>>>>> ways to do that, I would not start from that assumption. Is one
>>>>>>> imperfect or
>>>>>>> missing alt attributes TRUE or FALSE for SC 1.1.1 applied to the
>>>>>>> entire
>>>>>>> page? What about a less than perfect heading structure? etc,
>>>>>>> etc. There is,
>>>>>>> "objectively", always leeway, room for interpretation, and I
>>>>>>> think we
>>>>>>> unfortunately DO need agreement with reference to cases /
>>>>>>> examples that set
>>>>>>> out a model for how they should be rated.
>>>>>>>> If not we need Objectivity as a Requirement. Just Agreement on
>>>>>>>> something
>>>>>>>> is not enough.
>>>>>>> DF: Can you explain what in your view the requirement of
>>>>>>> "objectivity"
>>>>>>> should entail *in practice*, as part of the test procedure the
>>>>>>> methodology
>>>>>>> defines?
>>>>>>>
>>>>>>>> And again: No Objectivity - no standardized methodology.
>>>>>>>>
>>>>>>>> Kerstin
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Via Mobile
>>>>>>>>
>>>>>>>> Am 14.09.2011 um 12:09 schrieb Detlev Fischer<fischer@dias.de>:
>>>>>>>>
>>>>>>>>> DF: Just one point on objective, objectivity:
>>>>>>>>> This is not an easy concept - it relies on a proof protocol. For
>>>>>>>>> example, you would *map* a page instance tested to a
>>>>>>>>> documented inventory of
>>>>>>>>> model cases to establish how you should rate it against a
>>>>>>>>> particular SC.
>>>>>>>>> Often this is easy, but there are many "not ideal" cases to be
>>>>>>>>> dealt with.
>>>>>>>>> So "objective" sounds nice but it does not remove the problem that
>>>>>>>>> there will be cases that do not fit the protocol, at which
>>>>>>>>> point a human (or
>>>>>>>>> group, community) will have to make an informed mapping
>>>>>>>>> decision or extend
>>>>>>>>> the protocol to include the new instance. I think "agreed
>>>>>>>>> interpretation"
>>>>>>>>> hits it nicely because there is the community element in it
>>>>>>>>> which is quite
>>>>>>>>> central to WCAG 2.0 (think of defining accessibility support)
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> Detlev
>>>>>>>>>
>>>>>>>>>> Comment (KP): I understand the Denis' arguments. The more I think
>>>>>>>>>> about
>>>>>>>>>> this: neither "unique interpretation" nor "agreed
>>>>>>>>>> interpretation" work
>>>>>>>>>> very
>>>>>>>>>> well. I would like to suggest "Objective". Because of the
>>>>>>>>>> following
>>>>>>>>>> reason:
>>>>>>>>>> It would be one of Criteria for the quality of tests and includes
>>>>>>>>>> Execution
>>>>>>>>>> objectivity, Analysis objectivity and Interpretation
>>>>>>>>>> objectivity. If
>>>>>>>>>> we will
>>>>>>>>>> have in some cases 100% percent fine, if not we can discuss the
>>>>>>>>>> "tolerance".
>>>>>>>>>> I would suggest:
>>>>>>>>>>
>>>>>>>>>> (VC) I'm still contemplating this one. I can see both
>>>>>>>>>> arguments as
>>>>>>>>>> plausible.
>>>>>>>>>> I'm okay with 'objectivity' but think it needs more
>>>>>>>>>> explanation i.e.
>>>>>>>>>> who defines
>>>>>>>>>> how objective it is?
>>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> ---------------------------------------------------------------
>>>>>>> Detlev Fischer PhD
>>>>>>> DIAS GmbH - Daten, Informationssysteme und Analysen im Sozialen
>>>>>>> Geschäftsführung: Thomas Lilienthal, Michael Zapp
>>>>>>>
>>>>>>> Telefon: +49-40-43 18 75-25
>>>>>>> Mobile: +49-157 7-170 73 84
>>>>>>> Fax: +49-40-43 18 75-19
>>>>>>> E-Mail: fischer@dias.de
>>>>>>>
>>>>>>> Anschrift: Schulterblatt 36, D-20357 Hamburg
>>>>>>> Amtsgericht Hamburg HRB 58 167
>>>>>>> Geschäftsführer: Thomas Lilienthal, Michael Zapp
>>>>>>> ---------------------------------------------------------------
>>>>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>>
>> --
>> ---------------------------------------------------------------
>> Detlev Fischer PhD
>> DIAS GmbH - Daten, Informationssysteme und Analysen im Sozialen
>> Geschäftsführung: Thomas Lilienthal, Michael Zapp
>>
>> Telefon: +49-40-43 18 75-25
>> Mobile: +49-157 7-170 73 84
>> Fax: +49-40-43 18 75-19
>> E-Mail: fischer@dias.de
>>
>> Anschrift: Schulterblatt 36, D-20357 Hamburg
>> Amtsgericht Hamburg HRB 58 167
>> Geschäftsführer: Thomas Lilienthal, Michael Zapp
>> ---------------------------------------------------------------
>>
>>
>>
>
>

-- 
Shadi Abou-Zahra - http://www.w3.org/People/shadi/
Activity Lead, W3C/WAI International Program Office
Evaluation and Repair Tools Working Group (ERT WG)
Research and Development Working Group (RDWG)
Received on Thursday, 22 September 2011 08:50:29 UTC