Re: After today's call (Comments to John Foliot's alternative scoring proposal)

Hi Detlev,

Thank YOU for taking the time to review my proposal in depth, and for the
quality and thoughtful comments you provided back. There is lots here to
unpack, and over the next weeks I'm sure all of this will be discussed at
length by our working group. As I am somewhat pressed for time today, I'll
just offer some broad responses to some general themes, and will try and
get more granular in my responses later.

*Additive versus Subtractive scoring*: I know there are definitely two
camps within the working group on this topic. I've tried to analyze both
perspectives when putting together my proposal, and I arrived at an
additive approach mostly because I was thinking about some social
engineering here as well.

Plainly put, a subtractive mechanism feels like a punitive scheme, where
you *lose points* for not succeeding. In an additive scheme, you are
instead *rewarded points *for doing the right thing (succeeding), which
feels to me to be the opposite of being punished. As my grandmother used to
say, you catch more flies with honey than you do with vinegar. :-)

Seriously, since we are working with a graduated scoring proposal (Bronze,
Silver, and Gold), most organizations will likely first hit at a minimum
the Bronze level: to get a better score you'll need MORE points, those
additional points would be *added* to your existing score. For
inexperienced teams, trying to get more points is a clear-to-understand
goal; losing fewer points less so.

So I personally envision WCAG 3 as a way to encourage content owners to do
more, to add to their accessibility, to 'stretch' towards a better score -
and all of those goals (to me) are additive in nature. Frankly, it seems
(again, to me) to also be easier to explain: *do more = get more*, versus *do
more = lose less* - and yet, in the end, both schemes will get entities to
a 'score' that will be <= 100% - which in the end is really what is needed:
a score.

Looking at your example (1. keyboard-focusable, 2. AT-focusable, 3. Name,
role state exposed to AT) and it is not at all clear to me what the big
difference is: for discussion sake, let's presume that each of those 3
'requirements' would each be equal to 33.3% of a larger "ask". I fail to
see how adding versus subtracting in that use-case fundamentally changes
anything: if you only accomplish 2 of those three objectives your
score would be 66.6% whether you add or subtract.

*Non-subjective Tests: *Reviewing the initial round of comments
<https://github.com/w3c/silver/issues> in the aggregate, and it became very
clear to me that increasing the subjectivity of testing and reporting
requirements is a non-starter at scale: multiple industry commenters, and
comments on behalf of governmental organizations and NGOs, make it clear
that these organizations will be hard-pressed to support a specification
that is un-usable to them at scale. (I attempted to surface some of the
more germaine comments in the deck, which Google Sheets sometimes corrupted
the formatting - I can send you the actual deck if you'd like)

I'm not sure exactly how influential industry is in Europe with regard to
impacting legislation, but in the United States that influence can be (and
is) substantial. Candidly, without the support of large industry players,
getting WCAG 3 adopted into a regulatory framework will be
difficult-to-impossible, and so it is incumbent on us to listen carefully
to that feedback, especially when we hear it from multiple players of
differing sizes in multiple industries with multiple goals. (Unless of
course we are unconcerned about legislative adoption, but I suspect that is
not acceptable to us either.)

This is why I questioned the goal of WCAG 3: is the goal to encourage (and
teach toward) more accessible outcomes, or is the goal to *report on how
well that is being accomplished*? My real fear is that in an attempt to do
both, we sort of fail at both as well - I do not believe we can be all
things to all people and still be effective.

*"Criticality": *you wrote "Simply defaulting to FAIL by ignoring quantity
(say, failing a site with 100 images when two images have insufficient
alternative text) is the easy solution for an automated approach, but fails
to reflect the actual criticality of the problem to people with
disabilities." Respectfully Detlev, which people, and which disability (or
combinations of disabilities)?
(And for clarity, I proposed failing a 'view', or rather not rewarding
'points' if one or more images lack alt text in a view: I'm not going to
try and presume to know which images are critical versus non-critical,
because that is an individual determination - if you fail once, you've
failed in that view and you don't get the points for that view. This then
has a net impact of content owners remediating *all* images in a view, not
just the ones deemed "critical" to get their points, and adding up points
is, I propose, the name of the game.)

I believe this to be part of the fundamental problem however: ultimately we
are dealing with individual people, and not "peoples", and what may be
considered critical for one user with a specific disability may not be
deemed as critical to another user, even if they both share the same
disability. If my 20 years+ of experience in our field has taught me
anything, it's that this is a truism that cannot be ignored.

I then argue that if this is indeed true, attempting to determine the
criticality of something, especially when the criticality may also be
influenced by other contextual influences, is simply not something we can
evaluate with any degree of accuracy. And the ability or inability to be
able to *accurately evaluate something with consistency* is at the heart of
conformance reporting, even if it has little to do with usability. Even in
your commentary, you note that criticality is both subjective and
contextual, and I've yet to hear a proposal to address or overcome that
specific concern.

So yes, I am proposing that we only "measure and score" that which can be
accurately and uncontroversially measured and scored; that as a goal we
seek to squeeze out the known subjectivity we currently have in WCAG 2.x
even further, and that we avoid at all costs adding more subjectivity to
the scoring and conformance - which, again, IS NOT the same as evaluating
usability, which is by its very nature an individual determination, user by
user.

*User flows and Happy Paths: *Here, you wrote, "I think the addition of
tasks is valuable when teasing out critical functionality in complex
applications and distinguishing it from secondary or less critical
content." ...to which I can only reply that this is true for *usability
testing*, but not for *conformance testing*. This is a useful (valuable)
metric when designing and testing expected outcomes (in the aggregate), *but
it does not (can not?) reflect how individual users will use a page or
view/screen*.

I didn't want to single out any particular website, but with deference to
our colleague Peter Korn and the sincere effort Amazon is making
towards online accessibility (in fact, in some ways they are at times
leading), what is the primary "happy path" at www.amazon.com?

As I personally use Amazon on a regular basis, I frequently go to the home
page for multiple reasons: to find a new product, to track a product I
purchased, to do comparative price shopping analysis (without actually
buying anything), to look to see if there are any "specials" or featured
products I may need or want. Additionally, I recently had to update all of
my contact information at Amazon when I moved from Texas to Ontario (a
crucial if infrequent task)... so, which if any (or is it all) of these are
'critical' or 'less critical' functions? (Answer: "that depends...") And
because web pages frequently contain hyperlinks, which are all potential
'forks' in any happy path, how can we test conformance at scale with
consistency? The answer (again based on industry feedback) is we can't.

You also state, "To me, this seems driven by an emphasis on what can be
measured repeatedly and unambiguously (preferably in an automated fashion),
disregarding what *should* be measured if the impact on the user is the
main concern." - and there, right there, is the crux of the problem: are we
measuring and scoring usability (aka the impact on the user), or
conformance (aka we are following all of the predefined rules)?

*This is the big question I believe we need to be asking ourselves.*

At any rate Detlev, thanks again for taking the time to dissect my proposal
- you've done a great job of surfacing some of the issues we'll need to
resolve going forward if this proposal is to see the light of day, and I
truly appreciate the effort my friend.

JF

On Thu, Aug 12, 2021 at 12:36 PM Detlev Fischer <detlev.fischer@testkreis.de>
wrote:

> Hi John, hi all,
>
> I have gone through your slides, John, and written some comments (see
> below). Thanks for the tons of time and effort that must have gone into
> this - even though I favour a different approach.
>
> *Slide 3: Questioning the role of WCAG *
> The Focus on measurable outcomes is contrasted here to usability. I think
> it needs to be held agaisn all those crtieria that are nit usabilty but
> cannot be measured so easily. Your assumption seems to be her that
> measurable = repeatable without subjective dimension, as in ACT tests.
> The problem is that there is subjectivity in many SC PASS/FAIL assessments
> both in addressing quality and quantity of instances.
> Simply defaulting to FAIL by ignoring quantity (say, failing a site with
> 100 images when two images have insufficient alternative text) is the
> easy solution for an automated approach, but fails to reflect the actual
> criticality of the problem to people with disabilities.
> Failing to account for critical errors (say, the menu icon with empty alt
> vs. a hundred teaser images with lacking or bad alt that are followed by a
> meaningful link that do not constitute a significant barrier) leads to
> results that may be consistent / repeatable BUT (often) do not reflect
> criticality. The issue IMO is not subjectivity in addressing usability, the
> issue is subjectivity in assessing the criticality of a11y failures - from
> show stopper to neglegible.
>
> *Slide 6: Scoring Tests Based on Functional Categories Part 3*
> One way to intodruce weighting of issues across functional categories
> affected would be to aggregate how particular issues negatively affect the
> collection of outcomes when these are specific to functional categories.
> In a subtractive scheme where the sum of outcomes = 100%, an issue
> affecting several outcomes (say, a custom menu element affecting the three
> hypothetical Outcomes (1. keyboard-focusable, 2. AT-focusable, 3. Name,
> role state exposed to AT) would thereby negatively impact the score (by
> subtraction) more strongly than an issue affecting just one outcome.
> When a particular issue negatively impacts several functional categories,
> this could simply be reflected by failing a greater number of specific
> outcomes belonging to this category. In the case above, "focusable by
> keyboard" and "focusable by AT" should be separate outcomes to allow these
> outcomes to reflect the different functional categories in a granular way,
> and aggragate overall impact. So this would take care of reflecting how
> "valuable" a particular requirement is.
>
>
> *Slide 8: Scoring Tests Based on Functional Categories *Instead of
> scoring points for applicable categories (which will vary strongly over
> different media / technologies) I suggest *subtracting* points when
> individual outcomes FAIL, plus keeping track of CRITICAL FAILS separately
> to be able to reflet criticality for the user. A particular technology
> (say, the HTML/CSS/JS stack) has a total number of theoretically
> applicable outcomes. This is the base number from which individual failed
> outcomes are subtracted. For not applicable (N.A.) cases, nothing is
> subtracted.
>
> *Slide 10: Scoring Tests Based on Functional Categories Part 7*
> Some ACT atomic tests may map onto exactly one outcome, in other cases
> several ACT tests may need to PASS in order to PASS one outcome.
> When a platform does not support specific semantics, the corresponding
> outcome is not part of the total set of outcomes for this technology
> (obviously this can change, for example, when native apps do reflect
> heading hierarchy - as they now do for embedded web views).
>
> *Slide 11: A Return to Principles (consistency & equity)*
> I find the equivocation of the 4 Principles, useful as they may still be,
> with "equity based on user needs" a bit forced. The principles contain
> various requirements that are relevant for different user needs. I think
> drawing in user needs / functional categories is enough to provide a more
> fine-grained reading of an overall conformance score. When looking at the
> total score or outcomes split into functional categories will be clear if
> issues (= negative outcomes) fall more into the area of AT element
> semantics, pertaining to graphic aspects, or cognitive aspects, etc.
> Consistency with WCAG 2.X must be seen against internal coherence of the
> new conformance and scoring scheme. Today, we see many overlaps between
> Principle 1 (say 1.1.1 Non-Text Alternatives) and Principle 4 (say, 4.1.2
> Name, Role, Value) which goes far beyond Robustness. I think the
> principleas as overarching structure should be replaced by the reference to
> functional categories. Keeping both will create confusion.
>
> *Slide 13: A Return to Principles: Treat each Principle as an equal part
> of a total conformance *
> I am not sure whether the proposed division corresponds in any way to the
> impact of a11y issues across outcomes affected. Maybe it does, but I find
> the summary split in principles arbitrary and unnecessary. The many
> overlaps of issues in terms of outcomes affected means that a single
> allocation to principle is often not possible anyway. An issue such as the
> wrong role of a menu icon (say, it's just a div) may affect not just
> Principle 4 Robustness but also Principle 2 Operability (cannot be operated
> by speaking, cannot be keyboard-focused) and Perceivability (name may not
> be rendered to AT since an aria-label on a div may not be exposed). So the
> very progress of technology has made the separation by Principle
> increasinlgy awkward as a superstructure.
>
> *Slide 14: Using Profiles for Testing*
> As mentioned above, I think the profiles of technology are just a result
> of the set of outcomes that are theoretically applicable to them. Im a
> subtractive scheme, this will mean that any test (e.g. ACT test) that
> cannot be carried out due to the lack of elements to which it could apply
> would not contribute any outcome fails. A simple page with a bit of text
> cannot fail as many things and as badly as a highly complex interactive, or
> a media-rich page. What outcomes are applicable to a particular technology
> seems a better measure than the static allocation of a fixed share of the
> overall score to a particular Principle.
>
> *Slide 15: Measuring the Unmeasurable*
> To me, this seems driven by an emphasis on what can be measured repeatedly
> and unambiguously (preferably in an automated fashion), disregarding what
> *should* be measured if the impact on the user is the main concern. In the
> common situation where both quantitative and qualitative aspects contribute
> to the rating of an SC / an outcome on a page, closing the eyes to the
> qualitative aspect (how important is the missing alt on an element? Is it
> the main navigation control or a logo in the footer? Contextual aspects)
> fails to account for the actual impact on the user and fails to allow for
> tolerances (PASS when the impact is very low). I think subjective
> disagreements in rating are to some extent unavoidable - it is necessary to
> keep these low by offering finer grades of measurement, and by *managing*,
> rather than artificially excluding, the subjective factor.
> But this may be what is often called a "philosophical issue".
>
> *Slide 17: Protocols and Assertions*
> I cannot tell to what extent the ability to "collect points" by making
> assertions / buy in to protocols is likely to be abused. Larger companies
> with significant internal marketing resources will no doubt latch on to
> this, but it will often be difficult and time-consuming to check what this
> professed adherence means in practice. In my view, a conformance assessment
> should be based on a factual check of the test object that can be
> documented and verified by others, not on assertions that are difficult to
> understand and to verify and may even not have affected yet the
> accessibility of some content at the time a conformance claim for it is
> made.
>
> *Slide 25: Adding It All Up (abandon user flows and 'happy paths')*
> "This proposal no longer attempts to measure user flows or ‘happy paths’,
> as it is impossible to predict user behavior in a consistent fashion".
> I think the addition of tasks is valuable when teasing out critical
> functionality in complex applications and distinguishing it from secondary
> or less critical content. Defining exactly a path to be evaluated is
> exactly the prediction of a particular consistent behaviour and allows to
> apply the outcome tests without getting lost in the large number of
> possible permutations. I guess it is excluded here because it cannot easily
> be done automatically without significant effort of scripting these paths?
> I think the option in the Silver conformance approach so far (if I
> understand it correctly) that the scope can be set to a particular critical
> path and that the steps on this path then belong into the test sample, is
> valuable and should not be abandened, but reconciled with the extant
> page-based conformance approach (which is certainly not easy). The quoted
> complexity of scoring mechanisms seems orthogonal to the decision whether
> or not critical paths can be defined as scope of a conformance claim.
>
> *Slide 26: Adding It All Up (don't count instances of failures)*
> I agree with getting away from counting instances and looking at the
> (aggregate) view level, but I assume that exactly this makes it necessary
> to apply a measured judgement that accounts for quantity AND quality of
> instances in a view when rating (see comments above to slide 15). Automated
> assessments such as ACT tests can speed up this process but a value
> judgment remains to be made to do justice to the actual impact of an a11y
> issue on the user. (And in terms of ACT tests, this is the frequent human
> part of an overall test).
>
>
>
> *Slide 27: Adding It All Up (measurable requirements) *I am not sure I
> understand what is meant by "This proposal now suggests that a Critical
> Failure at the view level will have less impact on the overall score, as it
> constrains the failure to its source view." Critical Failures only appear
> here, at the very end of the presentation (or I missed them somewhere), so
> to me it is unclear what if any role they ought to have in this proposed
> altenative scoring scheme. Could it be that they are played down because
> automated checks may be not that good (yet) at flagging criticality?
> When the unit at which a Critical Failure is registered is the view, it
> remains to be seen what that means for an overall score for a site, or
> system under test. It would certainly make a difference it the critical
> failure occurs in a view at the end of a critical process (i.e., is a show
> stopper) or if it happens in some tangential part. Contextual factors also
> matter - can the user work around a failure or not?
>
>
> * Slide 28: Adding It All Up: Testing and Scoring Process: Keep It Simple!
> *In spite of the assurance to the contrary, the scoring process proposed
> here does not strike me as simple. Whether or not the division of the
> points score per view by the number of views evaluated arrives at a useful
> number, especially given that aspects are allowed to contribute that have
> nothing to do with the number of views (the protocols and assertions)
> remains to be seen.
> In my view, the idea of 100% or 100 points, or what ever, i.e. an
> idealised optimum that sites can approach, remains the measure that is
> easiest to understand. Given that the number of applicable outcomes varies
> by content technology, the total sum of relevant outcomes per tech should
> make up the 100%. Any subtractions will show how far away the site is from
> the optimum, and what improvements will be necessary. Breaking that down by
> functional category makes it clearer what user group is mainly affected and
> where the bulk of the work will be.
>
> Detlev
>
> --
> Detlev Fischer
> DIAS GmbH
> (Testkreis is now part of DIAS GmbH)
>
> Mobil +49 (0)157 57 57 57 45
> http://www.dias.de
> Beratung, Tests und Schulungen für barrierefreie Websites
>
>

-- 
*John Foliot* |
Senior Industry Specialist, Digital Accessibility |
W3C Accessibility Standards Contributor |

"I made this so long because I did not have time to make it shorter." -
Pascal "links go places, buttons do things"

Received on Thursday, 12 August 2021 21:01:54 UTC