[Note to submitters: Do not add to or change the document style; styles will be removed prior to publication. Ensure that your contribution is accessible (markup headings, paragraphs, lists, tables, citations, acronyms, and other document structures, and provide text alternatives for illustrations, graphics, and other images and non-text content; please refer to How To Meet WCAG 2.0 for more information); inaccessible contributions can not be accepted. Do not remove the following paragraph:]
This is a submission for the RDWG Symposium on Website Accessibility Metrics. It has not yet been reviewed or accepted for publication. Please refer to the RDWG Frequently Asked Questions (FAQ). for more information about RDWG symposia and publications.
Monitoring of web accessibility through regular benchmarks raises awareness and thus incites improvements of web sites. The eGovernment Monitoring (eGovMon) project [1] has collaborated with a group of Norwegian municipalities during more than two years, achieving encouraging results by a combination of evaluations and consultancy.
Initially, eGovMon used benchmarking tools based on the Unified Web Evaluation Methodology 1.2 (UWEM) [2], which describes conformance evaluations and large scale benchmarking for WCAG 1.0. However since WCAG 1.0 was superseded by WCAG 2.0, the eGovMon approach (including the implemented tests) had to be updated to accommodate the new guidelines.
This paper describes the requirements and challenges identified during the update of the metrics and reporting functions for WCAG 2.0 benchmarking. The core part of the reporting is the score function, which summarises the accessibility status of a web page or site into a single number. We are focussing on web pages scores because we assume that a web site score can easily be defined based on a soundly constructed web page score.
First we look at the requirements perspective: What are the desirable properties of an accessibility score function? Then we take a WCAG 2.0 specific view with special consideration to the new properties of WCAG 2.0 (as compared to WCAG 1.0). Beyond that, the final section of the paper presents some ideas for developing a unified WCAG 2.0 score function, which would allow the comparison of WCAG 2.0 evaluations carried out by different tools.
Ideally not only results from different tools should be comparable; it is also desirable to obtain more insights in the comparability of expert evaluations and automated tests. However this topic is beyond the scope of this paper.
During the work on UWEM 1.2 a process for indicator requirement analysis was established. First all potential properties were collected and grouped according to the parts of the evaluation process. There are requirements for crawling and sampling, requirements addressing mathematical and statistical properties of the score and requirements that describe how the score reflects certain features of the web content. Afterwards, a theoretical analysis investigated the dependencies and selected a set of non-conflicting properties for the score function. Finally, several suggested score functions were compared with regard to how well they meet the properties and the best candidate function was chosen. This process and its outcomes are described in the UWEM Indicator Refinement Report [3]. Because of the good experience with this process, we applied it also to the case of WCAG 2.0. The structural differences between WCAG 1.0 and 2.0 make it necessary to revise some of the requirements. Section 3 looks into this aspect.
Vigo and Brajnik [4] analysed the desirable properties of web metrics for various application scenarios. They also present some quality attributes for benchmarking. The most relevant items for the score function are low sensitivity towards small changes in the web page and adequacy of scale and range of the score values.
The score function should be tailored to the structure of the test set (in this case WCAG 2.0). Therefore we start out the design of the score function with an analysis of WCAG 2.0.
The WCAG 1.0 tests (as defined in UWEM 1.2) are independent: failure of a test also means failure of a WCAG 1.0 Checkpoint. The structure of WCAG 2.0 is different. The Techniques with their detailed test procedure provide a natural starting point for the implementation of an evaluation tool. But in the presentation of results the dependencies of the Techniques must be taken into account. On the one hand there are Common Failures which directly cause the web content to fail a Success Criterion (SC). On the other hand, conclusions from Sufficient Techniques can only be drawn if the logical combinations [5] are considered. We suggest the following approach to derive an implementation of tests for a Success Criterion:
Moreover, some Techniques are used by several Success Criteria. For these reasons an interpretation of the results below the level of Success Criteria is not meaningful. Success Criteria are selected as the first level of aggregation.
This is a major difference from UWEM 1.2, which has an erratic number of tests per WCAG 1.0 Checkpoint. Each test contributes equally to the score result causing Checkpoints with many tests to be over-represented in the result. Using Success Criteria as intermediary aggregation level has several further advantages. The priority Level of the Success Criterion can be included in the score. The influence of Success Criteria with many Techniques is balanced. In an automated tool it becomes easy to highlight which Success Criteria need human judgement or were not tested.
Disadvantages of the approach are that the number of instances of a specific feature (such as form control) does not influence the score. If a tool does not implement all Techniques related to a SC, no conclusion can be drawn.
We suggest the following score function, which takes into account the above considerations: The SC-level result for page p is defined as one minus the ratio of instances where tests for Success Criterion c failed on page p. (In the formula f_c(p) denotes the number of instances where tests for c failed and n_c(p) denotes number of all instances where tests for c were applied.)
The page score S is calculated as the average of the the SC-level page-results.
The eGovMon project is developing an online checker for WCAG 2.0 that uses the new page score function. Initial experiments show that the results of the function meet the requirements and are understandable for the potential users of the checker. We have also started the work on a web site score function. This involves addressing the following open questions:
Finally, further testing of the checker tool is planned to ensure that the score function meets the main "soft requirement": the score value must make sense when presented to the users.
Large scale benchmarking of web accessibility often relies on tools due to resource limitations. Although there exist a number of tools which claim to check according to WCAG 2.0, their result are still not comparable.
This problem is mainly caused by the varying granularity of tests (Some tools implement several tests per Success Criterion while others only have one test.) and the differences in counting the instances (Some tools count every checked HTML element while others only count each instance once.). The tools also differ in how outcomes are grouped into categories like "error", "potential error", "warning".
Another reason lies in the WCAG 2.0 documents themselves. Sometimes the aggregation of results from the Techniques is not documented well. Alonso et al. [6] describe the consequences of this challenge:
"This could lead to a situation where different evaluators use different aggregation strategies and thus produce different evaluation results."
A first simple step to increase the comparability of the results from different tools would be the introduction of aggregation on the level of Success Criteria, as suggested in this paper.
To define a truly unified WCAG 2.0 score and thus achieve actual inter-tool reliability—as demanded by Vigo and Brajnik [4]—a dedicated collaboration between tool developers and researches would be necessary to address the following tasks:
The eGovMon project is co-funded by the Research Council of Norway under the VERDIKT program. Project no.: VERDIKT 183392/S10.