This is a submission for the RDWG Symposium on Website Accessibility Metrics. It has not yet been reviewed or accepted for publication. Please refer to the RDWG Frequently Asked Questions (FAQ). for more information about RDWG symposia and publications.
Our contribution focuses on one fundamental aspect of website accessibility metrics: the adequate rating of content on a web page against WCAG success criteria (SC). We will discuss the problems of using a binary pass/fail rating approach, and propose a graded rating scale instead.
Real-life websites often show less-than-perfect accessibility – even sites that make an effort to be accessible. For example, the text of some alt attributes on a page may be not fully descriptive, some enumerations may not use HTML elements for lists, or some headings in an otherwise sensible headings hierarchy may not be not marked up correctly.
For some SC, binary pass/fail ratings make sense. The language of a page (SC 3.1.1), for example, is either correctly defined or it is not – there is nothing in between. For most SC, however, there is no discrete flip-over point at which a “pass” turns into a “fail”.
The German BITV-Test, a web-based accessibility evaluation tool, demonstrates the rating approach.
We focus solely on the BITV-Test’s page-level rating scheme and will not address other aspects of our approach.
Difficulties related to the binary rating approach exist on a number of levels:
Relevant website accessibility metrics must reflect the complexity of real-life web content.
We argue that pass/fail ratings make evaluations less valid and less reliable. If a failure for a single instance means the SC is failed for the whole web page, hardly any website would ever conform to WCAG. And a reasonably accessible (but not perfect) web page would not necessarily achieve a better rating than a glaringly inaccessible page.
Using a pass/fail rating, the evaluator is often forced to be either too strict or too lenient. When rating a good but not quite perfect page against a particular SC, he or she will have to decide between either failing the whole page because of one or two flawed instances amongst a lot of good ones (too strict), or completely ignoring the flawed instances (too lenient).
Different evaluators are likely to draw the line between pass and fail differently. With only two extremes to choose from, no amount of precision in the test procedure can ensure that individual evaluators will rate less-than-perfect content the same way.
Accessibility metrics should reflect the degree of success in meeting each success criteria. Therefore, we propose a rating system that is more granular than just pass/fail. Also, aggregating the granular page-level ratings to an overall website rating will reflect the overall level of accessibility more appropriately. This is why the BITV-Test uses a graded rating system.
BITV-Test checkpoints map to WCAG level AA SC. When testing against a particular checkpoint, evaluators assess the total set of applicable instances or patterns across a page and rate the overall degree of conformance on a graded Likert-type scale with five rating levels: from 100% for full conformance, to 0% for a clear failure.
Each assessment must reflect the criticality of individual flaws. When rating alt texts, for instance, a page with a crucial image-based navigation element with missing alt text would be rated as completely unacceptable (0%), whereas a page where just one of several teaser images has inadequate alt text would be rated as marginally acceptable (75%). In the latter case, the checkpoint would still contribute ¾ of its individual value to the overall score.
The reliability of a test procedure can also be expressed as the degree of replicability. Would another evaluator arrive at the same result?
The BITV-Test is often conducted as a tandem test: two evaluators complete a test based on the same page sample independently, and then run through all the checkpoints they have rated differently and agree on the final rating. This so-called arbitration phase not only helps detect oversights and corrects both too lenient and too strict ratings.
Our experience shows that a 5-level graded rating system is very reliable. What is needed are better criteria for determining which violations are critical enough to fail conformance independent of the aggregated site rating.