- From: Detlev Fischer <detlev.fischer@testkreis.de>
- Date: Thu, 12 Aug 2021 18:36:05 +0200
- To: w3c-wai-gl@w3.org
- Message-ID: <b54aa7c4-36e1-84a9-f7f9-6386739130f7@testkreis.de>
Hi John, hi all, I have gone through your slides, John, and written some comments (see below). Thanks for the tons of time and effort that must have gone into this - even though I favour a different approach. *Slide 3: Questioning the role of WCAG * The Focus on measurable outcomes is contrasted here to usability. I think it needs to be held agaisn all those crtieria that are nit usabilty but cannot be measured so easily. Your assumption seems to be her that measurable = repeatable without subjective dimension, as in ACT tests. The problem is that there is subjectivity in many SC PASS/FAIL assessments both in addressing quality and quantity of instances. Simply defaulting to FAIL by ignoring quantity (say, failing a site with 100 images when two images have insufficient alternative text) is the easy solution for an automated approach, but fails to reflect the actual criticality of the problem to people with disabilities. Failing to account for critical errors (say, the menu icon with empty alt vs. a hundred teaser images with lacking or bad alt that are followed by a meaningful link that do not constitute a significant barrier) leads to resultsthat may be consistent / repeatable BUT (often) do not reflect criticality. The issue IMO is not subjectivity in addressing usability, the issue is subjectivity in assessing the criticality of a11y failures - from show stopper to neglegible. *Slide 6: Scoring Tests Based on Functional Categories Part 3* One way to intodruce weighting of issues across functional categories affected would be to aggregate how particular issues negatively affect the collection of outcomes when these are specific to functional categories. In a subtractive scheme where the sum of outcomes = 100%, an issue affecting several outcomes (say, a custom menu element affecting the three hypothetical Outcomes (1. keyboard-focusable, 2. AT-focusable, 3. Name, role state exposed to AT) would thereby negatively impact the score (by subtraction) more strongly than an issue affecting just one outcome. When a particular issue negatively impacts several functional categories, this could simply be reflected by failing a greater number of specific outcomes belonging to this category. In the case above, "focusable by keyboard" and "focusable by AT" should be separate outcomes to allow these outcomes to reflect the different functional categories in a granular way, and aggragate overall impact. So this would take care of reflecting how "valuable" a particular requirement is. * **Slide 8: Scoring Tests Based on Functional Categories *Instead of scoring points for applicable categories (which will vary strongly over different media / technologies) I suggest *subtracting* points when individual outcomes FAIL, plus keeping track of CRITICAL FAILS separately to be able to reflet criticality for the user. A particular technology (say, the HTML/CSS/JS stack) has a total number of theoretically applicable outcomes. This is the base number from which individual failed outcomes are subtracted. For not applicable (N.A.) cases, nothing is subtracted. *Slide 10: Scoring Tests Based on Functional Categories Part 7* Some ACT atomic tests may map onto exactly one outcome, in other cases several ACT tests may need to PASS in order to PASS one outcome. When a platform does not support specific semantics, the corresponding outcome is not part of the total set of outcomes for this technology (obviously this can change, for example, when native apps do reflect heading hierarchy - as they now do for embedded web views). *Slide 11: A Return to Principles (consistency & equity)* I find the equivocation of the 4 Principles, useful as they may still be, with "equity based on user needs" a bit forced. The principles contain various requirements that are relevant for different user needs. I think drawing in user needs / functional categories is enough to provide a more fine-grained reading of an overall conformance score. When looking at the total score or outcomes split into functional categories will be clear if issues (= negative outcomes) fall more into the area of AT element semantics, pertaining to graphic aspects, or cognitive aspects, etc. Consistency with WCAG 2.X must be seen against internal coherence of the new conformance and scoring scheme. Today, we see many overlaps between Principle 1 (say 1.1.1 Non-Text Alternatives) and Principle 4 (say, 4.1.2 Name, Role, Value) which goes far beyond Robustness. I think the principleas as overarching structure should be replaced by the reference to functional categories. Keeping both will create confusion. *Slide 13: A Return to Principles: Treat each Principle as an equal part of a total conformance * I am not sure whether the proposed division corresponds in any way to the impact of a11y issues across outcomes affected. Maybe it does, but I find the summary split in principles arbitrary and unnecessary. The many overlaps of issues in terms of outcomes affected means that a single allocation to principle is often not possible anyway. An issue such as the wrong role of a menu icon (say, it's just a div) may affect not just Principle 4 Robustness but also Principle 2 Operability (cannot be operated by speaking, cannot be keyboard-focused) and Perceivability (name may not be rendered to AT since an aria-label on a div may not be exposed). So the very progress of technology has made the separation by Principle increasinlgy awkward as a superstructure. * **Slide 14: Using Profiles for Testing* As mentioned above, I think the profiles of technology are just a result of the set of outcomes that are theoretically applicable to them. Im a subtractive scheme, this will mean that any test (e.g. ACT test) that cannot be carried out due to the lack of elements to which it could apply would not contribute any outcome fails. A simple page with a bit of text cannot fail as many things and as badly as a highly complex interactive, or a media-rich page. What outcomes are applicable to a particular technology seems a better measure than the static allocation of a fixed share of the overall score to a particular Principle. *Slide 15: Measuring the Unmeasurable* To me, this seems driven by an emphasis on what can be measured repeatedly and unambiguously (preferably in an automated fashion), disregarding what *should* be measured if the impact on the user is the main concern. In the common situation where both quantitative and qualitative aspects contribute to the rating of an SC / an outcome on a page, closing the eyes to the qualitative aspect (how important is the missing alt on an element? Is it the main navigation control or a logo in the footer? Contextual aspects) fails to account for the actual impact on the user and fails to allow for tolerances (PASS when the impact is very low). I think subjective disagreements in rating are to some extent unavoidable - it is necessary to keep these low by offering finer grades of measurement, and by *managing*, rather than artificially excluding, the subjective factor. But this may be what is often called a "philosophical issue". *Slide 17: Protocols and Assertions* I cannot tell to what extent the ability to "collect points" by making assertions / buy in to protocols is likely to be abused. Larger companies with significant internal marketing resources will no doubt latch on to this, but it will often be difficult and time-consuming to check what this professed adherence means in practice. In my view, a conformance assessment should be based on a factual check of the test object that can be documented and verified by others, not on assertions that are difficult to understand and to verify and may even not have affected yet the accessibility of some content at the time a conformance claim for it is made. *Slide 25: Adding It All Up (abandon user flows and 'happy paths')* "This proposal no longer attempts to measure user flows or ‘happy paths’, as it is impossible to predict user behavior in a consistent fashion". I think the addition of tasks is valuable when teasing out critical functionality in complex applications and distinguishing it from secondary or less critical content. Defining exactly a path to be evaluated is exactly the prediction of a particular consistent behaviour and allows to apply the outcome tests without getting lost in the large number of possible permutations. I guess it is excluded here because it cannot easily be done automatically without significant effort of scripting these paths? I think the option in the Silver conformance approach so far (if I understand it correctly) that the scope can be set to a particular critical path and that the steps on this path then belong into the test sample, is valuable and should not be abandened, but reconciled with the extant page-based conformance approach (which is certainly not easy). The quoted complexity of scoring mechanisms seems orthogonal to the decision whether or not critical paths can be defined as scope of a conformance claim. *Slide 26: Adding It All Up (don't count instances of failures)* I agree with getting away from counting instances and looking at the (aggregate) view level, but I assume that exactly this makes it necessary to apply a measured judgement that accounts for quantity AND quality of instances in a view when rating (see comments above to slide 15). Automated assessments such as ACT tests can speed up this process but a value judgment remains to be made to do justice to the actual impact of an a11y issue on the user. (And in terms of ACT tests, this is the frequent human part of an overall test). *Slide 27: Adding It All Up (measurable requirements) *I am not sure I understand what is meant by "This proposal now suggests that a Critical Failure at the view level will have less impact on the overall score, as it constrains the failure to its source view." Critical Failures only appear here, at the very end of the presentation (or I missed them somewhere), so to me it is unclear what if any role they ought to have in this proposed altenative scoring scheme. Could it be that they are played down because automated checks may be not that good (yet) at flagging criticality? When the unit at which a Critical Failure is registered is the view, it remains to be seen what that means for an overall score for a site, or system under test. It would certainly make a difference it the critical failure occurs in a view at the end of a critical process (i.e., is a show stopper) or if it happens in some tangential part. Contextual factors also matter - can the user work around a failure or not? * Slide 28: Adding It All Up: Testing and Scoring Process: Keep It Simple! *In spite of the assurance to the contrary, the scoring process proposed here does not strike me as simple. Whether or not the division of the points score per view by the number of views evaluated arrives at a useful number, especially given that aspects are allowed to contribute that have nothing to do with the number of views (the protocols and assertions) remains to be seen. In my view, the idea of 100% or 100 points, or what ever, i.e. an idealised optimum that sites can approach, remains the measure that is easiest to understand. Given that the number of applicable outcomes varies by content technology, the total sum of relevant outcomes per tech should make up the 100%. Any subtractions will show how far away the site is from the optimum, and what improvements will be necessary. Breaking that down by functional category makes it clearer what user group is mainly affected and where the bulk of the work will be.* *Detlev -- Detlev Fischer DIAS GmbH (Testkreis is now part of DIAS GmbH) Mobil +49 (0)157 57 57 57 45 http://www.dias.de Beratung, Tests und Schulungen für barrierefreie Websites
Received on Thursday, 12 August 2021 16:36:23 UTC