[Note to submitters: Do not add to or change the document style; styles will be removed prior to publication. Ensure that your contribution is accessible (markup headings, paragraphs, lists, tables, citations, acronyms, and other document structures, and provide text alternatives for illustrations, graphics, and other images and non-text content; please refer to How To Meet WCAG 2.0 for more information); inaccessible contributions can not be accepted. Do not remove the following paragraph:]
This is a submission for the RDWG Symposium on Website Accessibility Metrics. It has not yet been reviewed or accepted for publication. Please refer to the RDWG Frequently Asked Questions (FAQ). for more information about RDWG symposia and publications.
The availability of resources such as guidelines, evaluation tools and even methods on web accessibility evaluation seems not to be enough to provide people with disabilities with accessible Web content. One of the most salient problems turns out to be the lack of awareness on the interaction context (Sloan, 2006). Even if it is of an utmost importance, guidelines do mention it vaguely and tools do not support it. In order to tackle this problem we developed a flexible framework to evaluate accessibility taking into consideration the interaction context (Vigo et al., 2007b). As far as metrics are concerned, we hypothesize that if the reports we obtain are context-tailored, so will be accessibility metrics.
If we consider that the interaction context of users is constrained by the content, accessing device, user agent, assistive technology or situation, Web accessibility evaluation and measurement should consider these aspects. Current accessibility metrics rely on traditional accessibility guidelines such as WCAG or Section 508, targeting all type of users. Normally, for a more precise assessment that considers specific user groups (blind, deaf, elderly, physically impaired, etc.) only those guidelines that target a determined user group are chosen. As far as accessing devices are concerned, Mobile Web Best Practices aim at guiding developers to build Web interfaces that provide a satisfactory user experience in mobile devices. Since mobile devices differ in their physical features, software support or interaction modalities, a common denominator device, namely Default Delivery Context, is proposed by the Mobile Web Initiative.
Grouping guidelines according to the disabilities they target or having a reference device is a first step towards specialization of accessibility metrics; but such a direction can be pursued even more. Metrics based on traditional guidelines grouping are ad-hoc solutions that cannot be applied if the interaction context is to be considered.
WAQM (Vigo et al., 2007) is too tied to WCAG so we decided to adopt a more flexible approach. Instead, we adopt the Logic Scores Preferences (LSP) method (Dujmovic, 1996), an aggregation model that based on neural networks computes a global score from the intermediate scores. This intermediate scores consist of failure-rates or absolute number of accessibility problems. LSP is formulated as follows:
E = ((W1E1)p(d) +..+ (WiEi)p(d)+.. +(WnEn)p(d) )(1/p(d))
where evaluation results produced by individual metrics are a set of normalized scores E1, .. ,En, where 0 ≤ Ei ≤ 1. When evaluated components have a different impact, positive normalized weights are associated to each evaluation result W1,..,Wn where 0 ≤ Wi ≤ 1 and sum(Wi)=1 . Values of p(d) are predefined elsewhere (Dujmovic, 1996) and they are selected upon the required logical relationship between elements of the system, be different levels of conjunction and disjunction. The output of the p(d) function changes depending on the number of elements to measure and d, which is the degree of disjunction. The value of d ranges from total disjunction (d = 1), arithmetic mean (d = 0.5), to conjunction (d = 0) in steps of 1/16. When simultaneity in satisfying the requirements is necessary, conjunction is applied; vice versa, if the objective is to penalize the main component only if one of the subcomponents fails, disjunction is applied. Normally, intermediate values are preferred as extreme cases do not apply. These intermediate ranges of values are (0 < d < 0.5) for quasiconjunctions (0.5 < d < 1) and for quasidisjunctions. Depending on the value of d, relationships between elements can be weak, medium or strong.
Let's assume that for the "provide alternative text to pictures" technique, an evaluation tool defines two test cases. One test case checks the existence of the alt attribute and the other checks if its value is adequate; let's also assume that both tests have the same weight and that we select the appropriate value for the strong quasiconjunction (see Dujmovic (1996)), in this case -3.15. Let's assume that there is an alt attribute for the picture (thus score1 = 1) but its content is empty (hence score2 = 0). In order to satisfy the checkpoint both test cases have to be successful thus we apply the strong conjunction for simultaneity, obtaining:
E = ((0.5x1-3.15)+ (0.5x0-3.15))1/-3.15 = 0
in this case E will be equal to one only if both values are high.
It is not straightforward to adjust the LSP values to the specific interaction context. This requires a trial and error strategy until reasonable values are found.
This measurement method was deployed in two specific contexts where the LSP was adapted to the particular characteristics of the guideline-sets involved.
Taking advantage of the reports that produce device-tailored accessibility reports, 102 pages (desktop and their mobile version) were evaluated considering the specific characteristics of two distinct mobile devices. Results show that higher scores are obtained for pages to be deployed in mobile devices and Web pages score higher in better-featured devices. Results may seem obvious but shed some light on the validity of the metric.
Then, 20 users were told to conduct a search by navigating task in 9 web pages of different accessibility level using a PDA. We found that task completion time and satisfaction were strongly correlated with the scores produced by non-device tailored metrics. This correlation was even higher with device-tailored metrics. This means that automatic conformance to guidelines entails higher usability levels even for non-contextual assessment although device-tailored assessment correlates stronger than non-tailored assessment. This goes against the common belief that tool conformance does entail usability.
Similarly, 16 screen reader users took part in an experiment were links were annotated with the accessibility score of the page the links point to. They conducted two tasks in two websites containing directories of 10 pages: browsing by navigating (idle browsing) and searching by navigating (with an specific target). Results suggest that there was not an agreement about accessibility scores although users believe that scores show the perceived accessibility level to an extent. Based on comments made by users we also concluded that the annotation technique prevails over the scores. None of the users chose the path of the most accessible links. However, we found that surprisingly users stopped browsing sequentially and browsed according to their preferences within the subset of more accessible pages.
The outcomes of the above research studies led us to make some theoretical proposals on adaptive accessibility metrics as a natural next step (Vigo and Brajnik, 2010).