Re: After today's call (Comments to John Foliot's alternative scoring proposal) from Detlev Fischer on 2021-08-12 (w3c-wai-gl@w3.org from July to September 2021)

From: Detlev Fischer <detlev.fischer@testkreis.de>
Date: Thu, 12 Aug 2021 18:36:05 +0200
To: w3c-wai-gl@w3.org
Message-ID: <b54aa7c4-36e1-84a9-f7f9-6386739130f7@testkreis.de>
Hi John, hi all,

I have gone through your slides, John, and written some comments (see 
below). Thanks for the tons of time and effort that must have gone into 
this - even though I favour a different approach.

*Slide 3: Questioning the role of WCAG *
The Focus on measurable outcomes is contrasted here to usability. I 
think it needs to be held agaisn all those crtieria that are nit 
usabilty but cannot be measured so easily. Your assumption seems to be 
her that measurable = repeatable without subjective dimension, as in ACT 
tests.
The problem is that there is subjectivity in many SC PASS/FAIL 
assessments both in addressing quality and quantity of instances.
Simply defaulting to FAIL by ignoring quantity (say, failing a site with 
100 images when two images have insufficient alternative text) is the 
easy solution for an automated approach, but fails to reflect the actual 
criticality of the problem to people with disabilities.
Failing to account for critical errors (say, the menu icon with empty 
alt vs. a hundred teaser images with lacking or bad alt that are 
followed by a meaningful link that do not constitute a significant 
barrier) leads to resultsthat may be consistent / repeatable BUT (often) 
do not reflect criticality. The issue IMO is not subjectivity in 
addressing usability, the issue is subjectivity in assessing the 
criticality of a11y failures - from show stopper to neglegible.

*Slide 6: Scoring Tests Based on Functional Categories Part 3*
One way to intodruce weighting of issues across functional categories 
affected would be to aggregate how particular issues negatively affect 
the collection of outcomes when these are specific to functional categories.
In a subtractive scheme where the sum of outcomes = 100%, an issue 
affecting several outcomes (say, a custom menu element affecting the 
three hypothetical Outcomes (1. keyboard-focusable, 2. AT-focusable, 3. 
Name, role state exposed to AT) would thereby negatively impact the 
score (by subtraction) more strongly than an issue affecting just one 
outcome.
When a particular issue negatively impacts several functional 
categories, this could simply be reflected by failing a greater number 
of specific outcomes belonging to this category. In the case above, 
"focusable by keyboard" and "focusable by AT" should be separate 
outcomes to allow these outcomes to reflect the different functional 
categories in a granular way, and aggragate overall impact. So this 
would take care of reflecting how "valuable" a particular requirement is.
*
**Slide 8: Scoring Tests Based on Functional Categories
*Instead of scoring points for applicable categories (which will vary 
strongly over different media / technologies) I suggest *subtracting* 
points when individual outcomes FAIL, plus keeping track of CRITICAL 
FAILS separately to be able to reflet criticality for the user. A 
particular technology (say, the HTML/CSS/JS stack) has a total number of 
theoretically applicable outcomes. This is the base number from which 
individual failed outcomes are subtracted. For not applicable (N.A.) 
cases, nothing is subtracted.

*Slide 10: Scoring Tests Based on Functional Categories Part 7*
Some ACT atomic tests may map onto exactly one outcome, in other cases 
several ACT tests may need to PASS in order to PASS one outcome.
When a platform does not support specific semantics, the corresponding 
outcome is not part of the total set of outcomes for this technology 
(obviously this can change, for example, when native apps do reflect 
heading hierarchy - as they now do for embedded web views).

*Slide 11: A Return to Principles (consistency & equity)*
I find the equivocation of the 4 Principles, useful as they may still 
be, with "equity based on user needs" a bit forced. The principles 
contain various requirements that are relevant for different user needs. 
I think drawing in user needs / functional categories is enough to 
provide a more fine-grained reading of an overall conformance score. 
When looking at the total score or outcomes split into functional 
categories will be clear if issues (= negative outcomes) fall more into 
the area of AT element semantics, pertaining to graphic aspects, or 
cognitive aspects, etc.
Consistency with WCAG 2.X must be seen against internal coherence of the 
new conformance and scoring scheme. Today, we see many overlaps between 
Principle 1 (say 1.1.1 Non-Text Alternatives) and Principle 4 (say, 
4.1.2 Name, Role, Value) which goes far beyond Robustness. I think the 
principleas as overarching structure should be replaced by the reference 
to functional categories. Keeping both will create confusion.

*Slide 13: A Return to Principles: Treat each Principle as an equal part 
of a total conformance *
I am not sure whether the proposed division corresponds in any way to 
the impact of a11y issues across outcomes affected. Maybe it does, but I 
find the summary split in principles arbitrary and unnecessary. The many 
overlaps of issues in terms of outcomes affected means that a single 
allocation to principle is often not possible anyway. An issue such as 
the wrong role of a menu icon (say, it's just a div) may affect not just 
Principle 4 Robustness but also Principle 2 Operability (cannot be 
operated by speaking, cannot be keyboard-focused) and Perceivability 
(name may not be rendered to AT since an aria-label on a div may not be 
exposed). So the very progress of technology has made the separation by 
Principle increasinlgy awkward as a superstructure.
*
**Slide 14: Using Profiles for Testing*
As mentioned above, I think the profiles of technology are just a result 
of the set of outcomes that are theoretically applicable to them. Im a 
subtractive scheme, this will mean that any test (e.g. ACT test) that 
cannot be carried out due to the lack of elements to which it could 
apply would not contribute any outcome fails. A simple page with a bit 
of text cannot fail as many things and as badly as a highly complex 
interactive, or a media-rich page. What outcomes are applicable to a 
particular technology seems a better measure than the static allocation 
of a fixed share of the overall score to a particular Principle.

*Slide 15: Measuring the Unmeasurable*
To me, this seems driven by an emphasis on what can be measured 
repeatedly and unambiguously (preferably in an automated fashion), 
disregarding what *should* be measured if the impact on the user is the 
main concern. In the common situation where both quantitative and 
qualitative aspects contribute to the rating of an SC / an outcome on a 
page, closing the eyes to the qualitative aspect (how important is the 
missing alt on an element? Is it the main navigation control or a logo 
in the footer? Contextual aspects) fails to account for the actual 
impact on the user and fails to allow for tolerances (PASS when the 
impact is very low). I think subjective disagreements in rating are to 
some extent unavoidable - it is necessary to keep these low by offering 
finer grades of measurement, and by *managing*, rather than artificially 
excluding, the subjective factor.
But this may be what is often called a "philosophical issue".

*Slide 17: Protocols and Assertions*
I cannot tell to what extent the ability to "collect points" by making 
assertions / buy in to protocols is likely to be abused. Larger 
companies with significant internal marketing resources will no doubt 
latch on to this, but it will often be difficult and time-consuming to 
check what this professed adherence means in practice. In my view, a 
conformance assessment should be based on a factual check of the test 
object that can be documented and verified by others, not on assertions 
that are difficult to understand and to verify and may even not have 
affected yet the accessibility of some content at the time a conformance 
claim for it is made.

*Slide 25: Adding It All Up (abandon user flows and 'happy paths')*
"This proposal no longer attempts to measure user flows or ‘happy 
paths’, as it is impossible to predict user behavior in a consistent 
fashion".
I think the addition of tasks is valuable when teasing out critical 
functionality in complex applications and distinguishing it from 
secondary or less critical content. Defining exactly a path to be 
evaluated is exactly the prediction of a particular consistent behaviour 
and allows to apply the outcome tests without getting lost in the large 
number of possible permutations. I guess it is excluded here because it 
cannot easily be done automatically without significant effort of 
scripting these paths?
I think the option in the Silver conformance approach so far (if I 
understand it correctly) that the scope can be set to a particular 
critical path and that the steps on this path then belong into the test 
sample, is valuable and should not be abandened, but reconciled with the 
extant page-based conformance approach (which is certainly not easy). 
The quoted complexity of scoring mechanisms seems orthogonal to the 
decision whether or not critical paths can be defined as scope of a 
conformance claim.

*Slide 26: Adding It All Up (don't count instances of failures)*
I agree with getting away from counting instances and looking at the 
(aggregate) view level, but I assume that exactly this makes it 
necessary to apply a measured judgement that accounts for quantity AND 
quality of instances in a view when rating (see comments above to slide 
15). Automated assessments such as ACT tests can speed up this process 
but a value judgment remains to be made to do justice to the actual 
impact of an a11y issue on the user. (And in terms of ACT tests, this is 
the frequent human part of an overall test).


*Slide 27: Adding It All Up (measurable requirements)
*I am not sure I understand what is meant by "This proposal now suggests 
that a Critical Failure at the view level will have less impact on the 
overall score, as it constrains the failure to its source view." 
Critical Failures only appear here, at the very end of the presentation 
(or I missed them somewhere), so to me it is unclear what if any role 
they ought to have in this proposed altenative scoring scheme. Could it 
be that they are played down because automated checks may be not that 
good (yet) at flagging criticality?
When the unit at which a Critical Failure is registered is the view, it 
remains to be seen what that means for an overall score for a site, or 
system under test. It would certainly make a difference it the critical 
failure occurs in a view at the end of a critical process (i.e., is a 
show stopper) or if it happens in some tangential part. Contextual 
factors also matter - can the user work around a failure or not?
*
Slide 28: Adding It All Up: Testing and Scoring Process: Keep It Simple!
*In spite of the assurance to the contrary, the scoring process proposed 
here does not strike me as simple. Whether or not the division of the 
points score per view by the number of views evaluated arrives at a 
useful number, especially given that aspects are allowed to contribute 
that have nothing to do with the number of views (the protocols and 
assertions) remains to be seen.
In my view, the idea of 100% or 100 points, or what ever, i.e. an 
idealised optimum that sites can approach, remains the measure that is 
easiest to understand. Given that the number of applicable outcomes 
varies by content technology, the total sum of relevant outcomes per 
tech should make up the 100%. Any subtractions will show how far away 
the site is from the optimum, and what improvements will be necessary. 
Breaking that down by functional category makes it clearer what user 
group is mainly affected and where the bulk of the work will be.*

*Detlev

-- 
Detlev Fischer
DIAS GmbH
(Testkreis is now part of DIAS GmbH)

Mobil +49 (0)157 57 57 57 45

http://www.dias.de
Beratung, Tests und Schulungen für barrierefreie Websites
Received on Thursday, 12 August 2021 16:36:23 UTC