AW: AW: AW: Goodness criteria from Kerstin Probiesch on 2012-03-23 (public-wai-evaltf@w3.org from March 2012)

From: Kerstin Probiesch <k.probiesch@googlemail.com>
Date: Fri, 23 Mar 2012 10:28:25 +0100
To: <detlev.fischer@testkreis.de>, <public-wai-evaltf@w3.org>
Message-ID: <4f6c4203.2266b40a.5735.7cad@mx.google.com>
Hi Detlev, all,

concerning reliability and as how I understand the concept it will be good to distinguish between "external" factors and factors of the test construction itself. A tester, and even an experienced tester, make mistakes or overlook something. As we are all just humans it is very likely that those observational errors are happening. They can arise out of the qualification but also if a tester has not enough time and have to hurry, is not concentrated enough, is sick or out of several other reasons.

Reliability is the part of the variance, which can be explained by actual differences and not by measurement error or the fluctuation of the measured characteristic. The reliability can range between 0 and 1.

Therefore angles must rely in the test construction itself and I think that we are on a excellent way (ok, probably except the optional score, which will depend what it meant by "score" ;-) )

> In my view, the critical question is to what extent we can nail down
> specific requirements, say, for reliability, in WCAG-EM.
> 
> You can think of different angles on that:

> * Requiring a particular evaluator qualification (quite difficult if
> only for national differences)
> * Requiring a particular level of evaluator experience (but how do we
> measure that?)

As written above observational errors might happen. I haven't understood "if only for national differences". WCAG 2 is WCAG 2. If there are national deviations from WCAG2 it is no more WCAG2. But probably I haven't understood this point.

> * Defining what the sample must include (mostly done although this
> might need changes)

I think the sample is one angle for the reliability and we have to be very exact on this point. The more flexible the requirement for sampling is the less reliable a test probably is. And I think concerning this issue we are also on a very good way.

> * Requiring that a test must be performed independently by more than
> one tester (this would improve reliability but is costy and will not be
> mandated by WCAG-EM if I am not mistaken)

I think the role of a second tester in a certain test (for example testing the website of the company X) is to find observational errors of the first tester. So it is excellent for evaluating the specific result of a certain test but I think not for evaluating the reliability of a test/a methodology itself.

But this is an important point: After a test construction is done it is important to evaluate the test itself.

@Shadi and Eric: Will there be a phase in our work where we are evaluating our methodology? For example: testing the same page, eliminating observational errors and after this is done have a look upon if the methodology is reliable enough or not. I think this would be an important step and should be done as soon as ever possible to minimize the risk of "failures" in our methodology. 

> * Designing some process to resolve differences in evaluator ratings /
> assessments (that is the BITV-Test approach)
> * Setting some threshold, or offset, for rating differences between
> independent testers that must be met so that the required level of
> confidence is met

I think these are important angle for minimizing observational errors.

But I also think that I don't get this point. Probably we don't mean the same with the term "independent tester". From my point of view there is a difference between "second tester" and "independent second tester". A "second tester" can be a tester from the same testing organization or an external tester. An "independent second tester" must be external otherwise the tester is not independent (how I understood the term "independent").

Best

Kerstin






 
> Regards,
> Detlev
> 
> --
> testkreis c/o feld.wald.wiese
> Borselstraße 3-7 (im Hof), 22765 Hamburg
> 
> Mobil +49 (0)1577 170 73 84
> Tel +49 (0)40 439 10 68-3
> Fax +49 (0)40 439 10 68-5
> 
> http://www.testkreis.de
> Beratung, Tests und Schulungen für barrierefreie Websites
> 
> 
> 
> ----- Original Message -----
> From: k.probiesch@googlemail.com
> To: shadi@w3.org
> Date: 22.03.2012 16:05:44
> Subject: AW: AW: Goodness criteria
> 
> 
> > Hi Shadi,
> >
> > I try my best.
> >
> > The mentioned article is about goodness criteria in qualitative
> studies. (there is a long going dispute over methods between
> researchers who are doing quantitative research and researchers who are
> doing qualitative research). One can for sure discuss if those goodness
> criteria should have that much relevance in _qualitative_ research.
> >
> > But: evaluating websites is not qualitative. Evaluating websites
> belongs to the quantitative field and in the quantitative field there
> is no question and no discussion at all about the relevance of
> reliability, objectivity and validity.
> >
> > Sorry, it seems today my english is more worse than every before.
> >
> > Best
> >
> > Kerstin
> >
> >
> >
> >
> >
> >
> >
> >
> >> -----Ursprüngliche Nachricht-----
> >> Von: Shadi Abou-Zahra [mailto:shadi@w3.org]
> >> Gesendet: Donnerstag, 22. März 2012 15:02
> >> An: Kerstin Probiesch
> >> Cc: public-wai-evaltf@w3.org
> >> Betreff: Re: AW: Goodness criteria
> >>
> >> Hi Kerstin,
> >>
> >> I must admit that I have difficulty understanding your specific
> >> suggestion or request, despite having read it several times.
> >>
> >> Would you mind rephrasing your comment more clearly?
> >>
> >> Thanks,
> >>    Shadi
> >>
> >>
> >> On 22.3.2012 14:32, Kerstin Probiesch wrote:
> >> > Hi Detlev, all,
> >> >
> >> > evaluating websites or pages without standardized methodology is
> for
> >> me nearly worthless. Users and also clients have not only an
> interest
> >> but the right that websites are tested with reliable tests and that
> >> accessibility means the same in every country and doesn't depend on
> the
> >> personal interpretation.
> >> >
> >> > Anyway. The link is very interesting and written from the
> perspective
> >> of qualitative social research: "When it comes to discussing
> goodness
> >> or quality criteria of the (qualitative) social sciences…". The
> article
> >> was published at Forum: Qualitative Social Research and is I believe
> >> part of the apologetic scientific literature of qualitative
> researchers
> >> in the context of the dispute over methods between scientists doing
> >> quantitative and those doing qualitative research.
> >> >
> >> > Nothing is "wrong" with that. Qualitative research is about people
> >> and typical methods are interviews (narrative, problem-centered,..)
> for
> >> example in the case of ethnographic field studies where great
> >> researchers like Malinowski did fundamental research – especially on
> >> Participant observation.
> >> >
> >> > Very interesting would be results of qualitative interviews with
> web
> >> developers about why some are advocating accessibility and others
> not.
> >> Or how they see their own knowledge about accessibility.
> >> >
> >> > But: evaluating websites is not qualitative social research.
> >> >
> >> > Kerstin
> >> >
> >> >> -----Ursprüngliche Nachricht-----
> >> >> Von: detlev.fischer@testkreis.de
> >> [mailto:detlev.fischer@testkreis.de]
> >> >> Gesendet: Donnerstag, 22. März 2012 12:40
> >> >> An: public-wai-evaltf@w3.org
> >> >> Betreff: Goodness criteria
> >> >>
> >> >> Hi list,
> >> >>
> >> >> just a few words about Kerstin's request to bring goodness
> criteria
> >> >> into the section 1.1 on scope.
> >> >>
> >> >> I#m not sure what this inclusion will add for those applying the
> >> >> methodology when conducting tests (or defining test procedures
> for
> >> >> others to follow).
> >> >>
> >> >> Here are my 2 cents on the three terms objectivity, validity,
> >> >> reliability:
> >> >>
> >> >> Objectivity
> >> >> Normally this refers to minimising individual (inter-evaluator)
> >> >> differences in observation or judgement.
> >> >> While we can objectively measure temperature, dimensions, etc.
> based
> >> on
> >> >> normative scales, there are several factors that make objectivity
> >> >> little more than an ideal that can be approached but never
> reached
> >> in
> >> >> website evaluation. Several aspects contribute to that:
> >> >>
> >> >> 1. Evaluators have different backgrounds and dispositions. One
> can
> >> try
> >> >> to minimise
> >> >> these differences by uniform curricula and training, and in
> >> dialogues
> >> >> aimed at a
> >> >> consensual adjustment of judgements in typical cases.
> >> >>
> >> >> 2. Web content out there is complex and often fails to fit the
> >> patterns
> >> >> described in
> >> >> documented techniques. There is nothing we can do about that :-)
> >> >>
> >> >> 3. The rating of Success Criteria is often not stricty
> independent
> >> of
> >> >> other SC.
> >> >> Instances can fail several SC at the same time, and context must
> be
> >> >> taken into
> >> >> account to judge instances. How that is done will often vary
> across
> >> >> evaluators.
> >> >>
> >> >> Validity
> >> >> The validity of an evaluation is ultimately the degree to which
> an
> >> >> evaluation result reflects the actual degree of accessibility
> across
> >> >> users with disabilities. So there is a strong temporal element
> here.
> >> >> The validiy of assessments will depend, for example, on the
> current
> >> >> degree of accessibility support of techniques used to claim
> >> >> conformance. As the web changes and relevant accessibility
> >> techniques
> >> >> change with it, maintaining validity means maintaining the
> >> timeliness
> >> >> and relevance of the techniques and failures that operationalize
> the
> >> >> general success criteria (or, if a tester wants to avoid any
> >> reference
> >> >> to documented techniques, maintaining the knowledge of what is
> >> currenty
> >> >> supported and what is not, or not yet).
> >> >> As WCAG-EM just references techniques maintained outside its
> scope,
> >> I
> >> >> wonder whether it is the right place to cover validity.
> >> >>
> >> >> Reliability
> >> >> Reliability seems to depend on several aspects:
> >> >>
> >> >> 1. the knowledge, diligence and amout of time invested by the
> >> >> individual evaluator
> >> >> across all relevant steps
> >> >>
> >> >> 2. the degree of operationalization: the more prescriptive the
> test
> >> >> procedure, the
> >> >> higher the likelihood of replicability. As WCAG-EM will not (for
> >> good
> >> >> reasons)
> >> >> go into detail regarding tools or particular procedures based on
> >> tools,
> >> >> I doubt
> >> >> that WCAM-EM alone can safeguard replicability (which might be
> the
> >> job
> >> >> of more
> >> >> prescriptive procedures based on it)
> >> >>
> >> >> 3. The amount of testers carrying out the same test (re-test,
> >> >> replicate) or the
> >> >> availability of additional quality assurance - again something
> >> probably
> >> >> to be
> >> >> defined beyond the scope of WCAG-EM
> >> >>
> >> >> As a last comment, I am not convinced that "goodness criteria are
> >> >> defined and internationally agreed in the scientific community"
> >> means
> >> >> that these are a given that can simply be referenced and taken
> for
> >> >> granted. This may be true for hard sciences, but an evaluation is
> >> >> subject to many 'soft' social and contextual aspects. One should
> aim
> >> to
> >> >> keep these in check, but it is impossible to eliminate them
> >> entirely.
> >> >> Instead, they must be managed. Perhaps  this article has some
> useful
> >> >> pointers:
> >> >>
> >> >> http://www.qualitative-
> >> research.net/index.php/fqs/article/view/919/2008
> >> >>
> >> >> Conclusion
> >> >> Why I think mentioning the goodness criteria in the section on
> scope
> >> >> probably does no harm, I am not convinced that this will improve
> the
> >> >> way WCAG-EM is used. It could be useful, however, to give
> guidance
> >> on
> >> >> how to approach or improve the aims of objectivivity, validity,
> >> >> reliability in practical terms. Whether such guidance can be
> >> >> prescriptive for operational procedures based on WCAG-EM, I am
> not
> >> so
> >> >> sure about. Let's dicuss...
> >> >>
> >> >> Best regards,
> >> >> Detlev
> >> >>
> >> >> --
> >> >> testkreis c/o feld.wald.wiese
> >> >> Borselstraße 3-7 (im Hof), 22765 Hamburg
> >> >>
> >> >> Mobil +49 (0)1577 170 73 84
> >> >> Tel +49 (0)40 439 10 68-3
> >> >> Fax +49 (0)40 439 10 68-5
> >> >>
> >> >> http://www.testkreis.de
> >> >> Beratung, Tests und Schulungen für barrierefreie Websites
> >> >>
> >> >>
> >> >>
> >> >
> >> >
> >> >
> >> >
> >>
> >> --
> >> Shadi Abou-Zahra - http://www.w3.org/People/shadi/
> >> Activity Lead, W3C/WAI International Program Office
> >> Evaluation and Repair Tools Working Group (ERT WG)
> >> Research and Development Working Group (RDWG)
> >
Received on Friday, 23 March 2012 09:28:05 UTC