Review of Web Accessibility Metrics Research Report

Dear Markel, dear Giorgio, dear Joshue,

after the deadline for review was extended, I thought that I should use 
the opportunity to send you some feedback.

First of all, I think the report is very comprehensive. It highlights 
many different aspects and has a clear structure and language. You did a 
great job in summarising the online symposium.

Now the details...

Comments on Research Report on Web Accessibility Metrics
(W3C Working Draft 30 August 2012)

Section 1.1 Definition and background.

In the beginning of this section the concepts "metric" and "indicator" 
got mixed up. I'd suggest to use the term "indicator" to refer to single 
dimensions that can be assessed objectively (such as number of pictures, 
violations, etc). Maybe you mean the same, when you refer to "basic 
metrics". In my opinion, a metric includes the combination of several 
indicators using different mathematical operations, weigthing parameters 
etc. - exactly as in your example (readability metrics).

As a side note: The item "The severity of an accessibility barrier." 
doesn't fit in the list because it the not an indicator (at least I 
don't know who it could be measured objectively).

The list "different types of data can be produced" mixes "ordinal 
values" and "conformance levels". These should be distiguished:
* Conformance levels (AAA, AA, A) have a fixed frame of reference 
(WCAG). It is possible to determine the conformance level of a single 
web site.
* Ordinal values (ordinal means "ordered") refer to something like a 
ranking, i.e. you can compare two web sites, determine which one is 
better, but not necessarily to which extent is a better than another 
one. It does not make sense to compute an ordinal value for a single site.

The distinction you want to make here is maybe between discrete and 
continuous values.
* Discrete values: for instance school grades "A, B, C, D, E, F"
* Continuous values: for instance values between 0 and 1 (maybe this is 
what you call "Quantitative ratio values".

As a side note: There are other mathematical properties of the results 
that could be interesting such as "bounded vs unbounded".


Section 1.2 The Benefits of Using Metrics
The reasons given in this section all relate to automated calculation of 
metrics. Also the last paragraph of the previous section discusses to 
relationship (mainly disadvantages) of metrics and automated testing.
Suggestions: Make it more explicit that metrics are not the same as 
automated testing. Discuss the benefits and disadvantages in the same 
section.


Section 2.1 Validity
The example (picture without alt) seems to question to validity of WCAG. 
The goal of the guidelines is the describe accessibility for the widest 
possible range of users. How can the definition of users in 
"accessibility-in-use" address this issue?


Section 2.3 Sensitivity
The logic in the language is strange. Is should say: "how changes in a 
given website are reflected in the metric output". The web site can not 
reflect the metric because it is independent of it.


Section 3.5 Novel Measurement Approaches
Wording: "counter-example techniques" -> "common failures"


Section 4.2 Validity
"Conformance" can not be viewed independent of the requirements to which 
conformance is claimed. That means that "validity with respect to 
conformance" is directly related to "validity of the requirements". But 
validity of requirements (or guidelines) is clearly beyond the scope of 
this TR. How can this research question be refined?


Section 4.3 Reliability

Question about the first item: In other parts of this report you say 
that a tool produces data and the metrics calculate the score from this 
data. So this research question can be interpreted in two ways: (1) 
compare the results of the same metric applied to the output of 
different tools. (2) compare the results of different metrics applied to 
the same tool output. - Both could be interesting.

Question about the second item: Is the data really independent from the 
guideline? - I think is it not, but that is of course also the view 
presented in my paper. The guidelines already contain a lot of 
information that can help shape the indicators (ie. the collected data) 
AND the metrics.


Section 4.4.3 Complexity
In some parts of the report you say that an easy metric is not 
necessarily a good metric. This is not the whole truth. Complex metrics 
(formulae with many unknown parameters such as weights for disability 
and severity) also cause many problems in terms of parameter estimation 
and justification. So in these cases simple might be better.


Section 5. A Corpus for Benchmarking Metric
A comment on tools: software has bugs, which can of course affect 
validity of the results. A benchmarking corpus could be use to improve 
the quality of software.

Is is important to define if the corpus should consist of labeled or 
unlabeled examples. And what would the labels be? Binary labels 
(accessible vs. not accessible) are not sufficient. But on the other 
hand any more complex definition of labels would be a metric in itself.


Section 5.2 User-tailored metrics
It would be heplful to clarify the relationship of "user-tailored 
metrics" to the concept of "accessibility-in-use" mentioned earlier.


Other comment:
And finally some input from the discussions during the ICCHP session: A 
topic that came up several times was that the idea of enhancing 
automated tests by combining them with expert or user input. This should 
  also be mentioned in the road map.



I hope my comments are helpful for the finalisation of your report. I'd 
be happy to discuss and provide further details, in case you have any 
questions.

Kind regards
Annika

-- 
Annika Nietzio
email an@ftb-volmarstein.de
web          www.ftb-net.de
phone  +49 (0) 2335 9681-29

Forschungsinstitut Technologie und Behinderung (FTB)
der Evangelischen Stiftung Volmarstein
Grundschoetteler Str. 40  D-58300 Wetter/Ruhr  Germany

Received on Friday, 5 October 2012 14:11:52 UTC