Submission for the RDWG Symposium on Website Accessibility Metrics

This is a submission for the RDWG Symposium on Website Accessibility Metrics. It has not yet been reviewed or accepted for publication. Please refer to the RDWG Frequently Asked Questions (FAQ) for more information about RDWG symposia and publications.

A zero in AChecker means 10 in eXaminator: a comparison between two metrics by their scores

1. Problem Addressed

A zero on the scale of the validator Accessibility Checker [1] is theoretically the same as a 10 on the scale of the validator eXaminator [2] - both mean that the page passed its full battery of tests and, in principle, correspond to a good practice. But to what extent do the two metrics coincide in their judgments and scores? What is the type of correlation that can be inferred by the score? Will some metrics be more stricter than another? Will they tend to be more benevolent in the lower grades and more conservative in assigning higher notes?

2. Background

When in 2005 UMIC - Knowledge Society Agency of the Ministry of Science, Technology and Higher Education put online the automatic validator eXaminator, maybe the first nonexperimental validator based on a quantitative metric, the users' reaction was interesting: the accessibility experts reacted badly - "accessibility can not be summed to a note" - but all the others who didn't have a comprehensive and authoritative knowledge of what accessibility was about, almost all of our target, perceived quite clearly the score given. A score from 0 to 10, where 10 is a good practice.

But deep down, eXaminator has strong roots in pericial manual evaluations of a team that does manual evaluations since 2000. It was based on checkpoints lists of WAI, but soon it was neccesary to give some stars scores to analysed practices in order to create rankings of sites. Unlike metrics like WAQM - Web Accessibility Quantitative Metric, which seeks to achieve a failure rate for each page or UWEM accessibility metric seeking a failure rate for each checkpoint [4], the eXaminator does an exercise equivalent to an expert evaluator.

The metric of eXaminator aggregates, for each test, the results (error or positive practice) in occurrences. And it does so related to each page. Then it assigns to each occurrence, as if it were a teacher, a score to each one of these occurrences. The score tries to reflect the seriousness of the occurrence found, and also reflects the effort that the designer / developer needs to do to achieve good practice. [see table 1].

Table 1: The occurrences of test - alternative text in images.
Test i Occurrence j Score (Xij)
Alternative text on images All images on the page have an alt 10
There is an image on the page without alt 3
There are several images on the page without alt 1

Thus, the metric of eXaminator responds faithfully to the definition of W3C's compliance [5] and never lost the unit of conformity: the page. This is even more true when we tried to analyse data at different levels - others than the page: website level or at the level of an aggregation of websites. We could express the data like "In % of pages 'There is an image on the page without alt'" [see figure 1].

Another curious thing to notice is that using the metric of eXaminator, often the results are not the most expected ones. The following is an assumption made by the W3C on background information of this symposium:

"(...) For example, is a web page with two images with faulty text alternatives out of ten more accessible than another page with only one image with a faulty text alternative out of five? "

The metric of eXaminator response to this type of occurrence isn't conventional: the page with only one faulty image with text alternatives would be considered in a more positive practice (Xij = 3) than the one that has two images without alternative text (Xij = 1). For a designer/developer, which would be the page that would require less effort to achieve compliance? The answer is obvious: the page that has only one image for caption, regardless of the number of images that the page may have.

Our analysis [6] [7] shows that the metrics that count failure rates of checkpoints have systematically more positive results (inflated results) than those that respect the concept of W3C's compliance [see figure 1].

Figure 1: results of analysing images in the study [6] (Portuguese Municipalities 2009)

W3C conformance is ... see the description below

Description of figure 1: on the left, the graphic of pie shows a relation of 20/80 were "on 20% of pages, all images have an alt". On pie graphic on the right a relation of 38/62 is represented, where "38% of images have an alt". According to W3C's conformance definition, the assumption "on 20% of pages, all images have an alt" is more appropriate to express the conformity. This data are from the same sample, so it is evident that the results by element (image) are inflated.

The metric of aggregation results by occurrences was also tested in the study [8] presented in 2005 in the UK's Presidency of EU. It was then possible in the study [6] to do some comparative analysis by HTML elements of UK study with the data collected by eXaminator in Portuguese Municipalities:

"By comparing some of its metrics collected in the present study we found that the nature of faults found and its extension are similar. The failure of alternative text in the picture is slightly more severe in Portuguese municipalities than that detected in the public services of Member States of the EU: 80% vs 64%, the use of alt in the areas of image maps is slightly better: 40% vs 50 %; pages without headings is much worse in Portuguese municipalities: 74% vs 28%; HTML errors: 90% vs 100% and equal levels of depricated code in the order of 95%."

Apart from any subjective notes (Xij) assigned, the final result of each test is still weighted in accordance to the priority level corresponding to the checkpoint to which they relate.

Table 2: Weighting of tests by priority
Tests related to... Wj
checkpoints of priority 1 10
checkpoints of priority 2 8
checkpoints of priority 3 6

In the metric of eXaminator for WCAG 1.0, all the 61 tests are based on the same formula, expressed in figure below:

Figure 2: formula to calculate the global index of eXaminator metric web@x - web at eXaminator

webax = (sum of x*w)/sum of w

Xij represents the corresponding note to the occurrence j found in a given test i. For example (see table 1):

The Wi corresponds to the weight of each test according to the priority levels of each checkpoint to which they relate.

3. Strategy

To test for correlation between the overall scores of the Accessibility Checker (UWEM 1.2 metric) and eXaminator, we turn to a bank of pages collected in January 2010 corresponding to the first page of the 308 Portuguese municipalities [9]. In eXaminator we only use priority 1 and 2 checkpoints of WCAG 1.0. Originally eXaminator also had tests to same priority 3 checkpoints.

Through eXaminator we filter special cases: pages with basic errors (pages with more than one <body> element or more than one <head> element), pages made completely in Flash, pages based on Frames, pages with iFrames, pages with few elements (less than 50 html elements). With these pages filtered, we got 254 pages - these is what in eXaminator we call pages with typical cases.

Then we ordered the Accessibility Checker series and we tried a graphical observation of both scales, drawing a line corresponding to a linear distribution (trendline) [see outcomes section of this paper].

To represent both scores on the same scale we transformed the scale of AChecker, originally 0-1, on a scale of a 0 to 10, where 1 of AChecker represents in eXaminator 0 and 0 in AChecker represents 10 in eXaminator. For this purpose we used the following formula:

x=(1-y)*10

x = score in eXaminator scale; y= score in the AChecker scale.

Finally, we made the calculation of Pearson correlation coefficient between the distribution of scores of AChecker and eXaminator.

4. Major Difficulties

The way the two metrics rank their tests is very difficult to compare. Sometimes it seems that the tools are seeing completely different pages.

5. Outcomes

By observing the two distributions of scores in a graphical form, ordering the distribution of scores of AChecker, it is possible to verify that there is "some" correlation between the two series. The transformation of the two series in a linear representation allows us to conclude that the way of scoring at eXaminator results in more conservative scores when compared to AChecker score. At lower values the difference is in the order of 2.5 points while in the higher values the difference is in the order of 5 points.

The calculation of Pearson's correlation coefficient between the two distributions indicate that values of p = 0.41 which means a moderate positive correlation. If we work with the original distribution of AChecker (i.e. 0 to 1) the Pearson's coefficient is negative, which means a negative correlation between the two series. Good practice in AChecker means going from 1 to 0 and same good practices in eXaminator means going from 0 to 10.

But as both metrics are based on the checkpoints of priority 1 and 2 of WCAG 1.0, it would be expected for values of p to be closer to 1.

Figure 3: a graphic comparison between the scores distribution of eXaminator and AChecker
(the score of AChecker is ordered)

The scores distribution of AChecker and eXaminator

6. Open Research Avenues

Using the same sample and comparing the scores of eXaminator (WCAG 1.0) with the AccessMonitor (WCAG 2.0), operating in the Portuguese Public Administration in beta since early 2011 [3], we obtain a p = 0,949, which shows a strong correlation between the two metrics. We need quantitative metrics which use WCAG 2.0 to compare with our new metric which not only uses one formula to all checkpoints but 4 types of formulas to calculate the scores of success criteria according to their nature.

References

  1. validator Accessibility Checker: http://accessibility.egovmon.no/en/pagecheck/
  2. validator eXaminator (portuguese): http://www.acesso.umic.pt/webax/examinator.php
  3. validador AccessMonitor (portuguese): http://www.acesso.umic.pt/accessmonitor/
  4. Brajnik, G. (2007) Effects of sampling methods on web accessibility evaluations. Proceedings of ASSETS'07. DOI: 10.1145/1296843.1296855
  5. WCAG 1.0 - chapter of conformance: http://www.w3.org/TR/WCAG10/#Conformance
  6. Fernandes, J. (2009) Web Content Accessibility of the Portuguese Municipalities 2009 - analyzing the first page. (in Portuguese) Lisboa: SUPERA - Sociedade Portuguesa de Engenharia de Reabilitação e Acessibilidade. Setembro de 2009. http://www.supera.org.pt/index.php/actualidades/the-news/27-news/63-municipios09.html
  7. Fernandes, J. (2011) Study "Compliance of Portuguese Public Administration websites with WCAG 1.0 - 2008 - 2010". http://www.apdsi.info/uploads/news/id410/jorge%20fernandes.pdf
  8. UK Presidency of EU 2005. (Novembro 2005) eAccessibility of public sector services in the European Union. Consulted at October 20th, 2007 from http://fastlink.headstar.com/coi2
  9. Portuguese Municipalities (database of homepages frozen to January 2010): http://www.acesso.umic.pt/cm/