[Note to submitters: Do not add to or change the document style; styles will be removed prior to publication. Ensure that your contribution is accessible (markup headings, paragraphs, lists, tables, citations, acronyms, and other document structures, and provide text alternatives for illustrations, graphics, and other images and non-text content; please refer to How To Meet WCAG 2.0 for more information); inaccessible contributions can not be accepted. Do not remove the following paragraph:]
This is a submission for the RDWG Symposium on Website Accessibility Metrics. It has not yet been reviewed or accepted for publication. Please refer to the RDWG Frequently Asked Questions (FAQ). for more information about RDWG symposia and publications.
1.1. Measuring Web accessibility is to measure quality of Web content from an accessibility standpoint. Two key goals stand out for measuring the current level of accessibility of a website / collection of Web content:
1.2. An enterprise attempting to improve Web accessibility may depend on accessibility metrics to answer questions like:
1.3. Most users who depend on accessibility would want to know the accessibility ranking of Web pages returned by a search engine. In this context, a ranking expressed in simple terms, say, on a scale of 1 to 5 (low to high accessibility) or good–average–poor, is likely to be very usable. Discriminative power of such rankings will help users choose preferred websites for online tasks.
1.4. Web accessibility metrics should cater to the above goals, be practical to compute and be easy to interpret. Like any metric, it should be valid and reliable. In short, reliable methods that yield consistent results of well-defined accessibility measures are needed to meet above goals.
Several factors are relevant while discussing what constitutes an accessibility violation and what makes a Web accessibility evaluation process reliable.
2.1. Elaborate supporting documentation exists for the WCAG 2.0. The success criteria have been drafted in a manner that is considered to be highly “testable”. Yet every success criterion has multiple facets that need to be evaluated before one can reliably conclude that the criterion has been met. Not all facets can be checked reliably in an automated fashion. It requires exercise of human judgment as an element of subjectivity is involved. E.g. alt-text for images, labeling form controls, etc.
2.2. ”Accessibility in use” is not well-defined, is subjective and inter-rater-reliability of accessibility measures is likely to be poor. Some accessibility barriers may prevent users from perceiving certain Web content or associated functionality altogether. Methods like crowd-sourcing would likely yield invalid accessibility measures then.
2.3. Therefore, for this paper accessibility is measured in terms of conformance with WCAG 2.0. The WCAG 2 requires a website to meet all applicable Level-A SC in order to conform at that level. Level-AA SC build on critical accessibility level attained by conforming to Level-A SC. Measuring success against Level-AA SC is relevant only for pages that are Level-A conformant.
2.4. Automated Web accessibility evaluations is speedy but tools differ in their evaluation techniques, algorithms and scope of coverage. Therefore, results produced by two automated evaluators can vary significantly. Reliability issues, both inter–tool and intra-tool (as algorithms are enhanced over time) are most likely. Tools are known to produce false positives and results need vetting; it is widely acknowledged that they can identify only some accessibility barriers without requiring human confirmation. They fail to detect some violations completely (false-negatives) [1].
2.5. On the other hand, manual testing involving code-inspection to confirm / discover issues by accessibility experts and real users of assistive technologies is expensive, time-consuming and impractical when thousands of pages need to be evaluated. But the results are subject to smaller degree of invalidity [1]. Inter-rater reliability issues are likely; intra-rater unreliability would perhaps be very low.
2.6. Accessibility support for a specific Web technology as required by WCAG 2 on a particular platform is influenced by the interoperability of particular assistive technology [AT] with browsers used generally by people with disabilities on that platform. Versions of At and browsers are important in determining whether something is indeed accessibility supported especially with respect to the Web 2.0 world and use of new technologies like WAI-ARIA, HTML5 and AJAX. Rapidly changing versions do not make this determination simpler; they may quickly invalidate results of recent accessibility testing. This may not affect ordinary non-dynamic HTML content as much.
2.7. The reported violation count will also depend on how the testing process handles:
Based on the foregoing it may be evident that:
5.1. The choice of accessibility evaluation methodology and Web accessibility metrics depend on the goal.
5.2. Dependable Web accessibility metrics can be produced only when evaluation processes identify valid violations reliably and consistently. Today, this is possible only through a combination of automated testing, manual testing and AT-based testing.
5.3. Over time a tester must develop enough confidence in a tool’s abilities so that he knows which violations reported by the tool are dependable. Testers may use more than one tool but the intent should be to exploit the strengths of the different tools to obtain a comprehensive list of accessibility issues on a website.
5.4. When the intent is to compare accessibility levels across collections / pages or to predict accessibility levels:- relies solely on automated testing partly because it is time-bound and ability to vet results or perform additional manual tests is limited. So the results would most likely be incomplete. Attempting to relate such data with parameters like page-size or number of HTML elements (links, form controls, images, objects, tables, frames, etc. on the page) to formulate a more sophisticated accessibility ranking algorithm would produce a metric with suspicious validity. Accessibility predictions based on such data will be of poor quality.
5.5. When the intent is to implement accessibility fixes, it is sufficient to:
Submitted on Nov 1, 2011 to public-wai-rd-submission@w3.org