RE: Conformance and method 'levels' from Abma, J.D. (Jake) on 2019-06-24 (public-silver@w3.org from June 2019)

From: Abma, J.D. (Jake) <Jake.Abma@ing.com>
Date: Mon, 24 Jun 2019 06:47:09 +0000
To: 'John Foliot' <john.foliot@deque.com>, "Hall, Charles (DET-MRM)" <Charles.Hall@mrm-mccann.com>
CC: Alastair Campbell <acampbell@nomensa.com>, Silver Task Force <public-silver@w3.org>, Andrew Kirkpatrick <akirkpat@adobe.com>
Message-ID: <b5fdc63578dc419a8480f653ad7bde25@SU8000007192.ad.ing.net>
Hi John,

If I’m correct this is the way (most) of us already test, isn’t it?
But then we’ll add a point system on top (instead of only pass/fail) and try to make usability a more explicit part (I do that already but more informal…)
So staying very close to what we already have with the challenge of what “usability” is related to scoring.

The “net result” will be:

New testing = (current test time/money + adding points) + (new usability framework + new expertise + adding point)

Although I do like it (as said, this is how we already test, we even put the usability in front as this can be much more critical) I also know from experience this will take at least twice as much time as regular pass/fail tests. Specially as you have to explain much more and need to convince people the reasons for the needed adjustments.

Will this be a problem if we demand more test / explanation time or will this just be a part of the new model for better A11Y?

Cheers,
Jake



Regards,
Jake Abma

Accessibility Lead ING
Product owner at Team A11Y
WCAG Expert @W3C
http://www.a11yportal.com<http://www.a11yportal.com/>

Omnichannel FE Platform
ING Nederland
ACT C.02.460, Bijlmerdreef 24
Postbus 1800, 1000 BV Amsterdam
0031 (0)6 - 25 27 52 46
jake.abma@ing.com







From: John Foliot [mailto:john.foliot@deque.com]
Sent: maandag 24 juni 2019 1:53
To: Hall, Charles (DET-MRM) <Charles.Hall@mrm-mccann.com>
Cc: Abma, J.D. (Jake) <Jake.Abma@ing.com>; Alastair Campbell <acampbell@nomensa.com>; Silver Task Force <public-silver@w3.org>; Andrew Kirkpatrick <akirkpat@adobe.com>
Subject: Re: Conformance and method 'levels'

Hi Charles,

In a spread-sheet that I've circulated to a limited group, I've expanded on some of those thoughts.

As you note, testing (and subsequently "fixing") some issues are easier than others. I've broken that 'development/verification' process into three 'buckets':

  1.  Automated tests: these are as simple as a "click of the button", and it would give you a conformance report on those things that can be automatically tested for (yes, Deque has a rules engine, but most of the major testing platforms are participating in the ACT TF, where they are 'standardizing', if not all of the test, certainly the format by which those tests are written and expressed).
Given that these requirements are "easy" to identify and fix, they should accrue a smaller number of "points"
  2.  Human *Verification*: these require more of a human intervention/cognition requirement (eg. verify that the alt text makes sense in context, ensure that caption & audio description files are accurate, verify if a table is being used for layout or for tabular data, etc.). In these cases, the amount of effort and time (aka "resources") is greater (measured on a scale?), and so likely occur less frequently.
Because these requirements are 'harder' to address (and so likely less frequently), they accrue more "points" than category 1 above.
  3.  Finally, the 'hard' tests - the tests and requirements that require human cognition and testing (i.e. cognitive walk-through's, etc.). These types of verification's are complex, and 'expensive' to perform, but deliver great value and really drive the site to best accessibility.
Because these types of tests are the hardest to perform, they also accrue the most "points".
(As some real-world feedback, at Deque, we have both tools and a process for the first two categories of testing today: our axe-core rules engine, which is also the heart of our free browser extensions, and another tool that also provides a 'guided walk-through' for verifying the rest of the SC that cannot be fully mechanically tested)

> as JF  points out, means the point score is only meaningful until the next day the site is published.

"Scoring" a web site the way I am envisioning would then require some kind of centralized data-base, and (yes) some additional tooling to process that data. The 'scoring tool' totals the scores from each of the three categories above, to arrive at the final score. As such, I envision the score to be dynamic in nature.

Charles is right, time takes a toll on the score. The automated tests can be as frequent as "now", the secondary set of tests can happen with some frequency (weekly?), however the 'hard' tests will happen infrequently. To address this concern, I've also thought about perhaps "stale-dating" tests results, with the older they become, the less valuable they are to your total score.

As a straw-man example, consider the third (hard) category of tests. In that scenario, I could envision something like losing 10% of the score value every 90 days (3 months). If you do a full-court press on the "hard" tests to get a high score on site launch (for example), but then never do those tests again, then over time, they deteriorate (depreciate). So if you score 300 points in this category on January 1st, by April first they are only worth 270 (or, perhaps if they scored 300 out of a possible 400, then they'd lose 10% of the 400 score, thus 260), by the end of 6 months they've depreciated by 20% (so either a score of 240 [270 - another 30 = 240], or 220 (260 - another 40 = 220). Even if the site remains 'static', with zero changes in those time-frames, you still lose the points, because while your site may have not evolved, the web (and techniques) do.

We can model this out in different ways to see what works best, but the fundamental idea is that over time, you lose points because your tests results are getting "stale", so to keep up your score, you have to continually do the "harder" testing too.

Thoughts?

JF



On Sun, Jun 23, 2019 at 4:23 PM Hall, Charles (DET-MRM) <Charles.Hall@mrm-mccann.com<mailto:Charles.Hall@mrm-mccann.com>> wrote:
I like the idea of 2 currencies.
The ephemeral nature of the web as JF  points out, means the point score is only meaningful until the next day the site is published.

The core problems that the proposed model are trying to solve are all the “it depends” and “it’s partially conformant” cases while framing the model in a way that it encourages better practices versus always aiming for the minimum.

With 2 currencies, we could have something of a “raw score” to be regularly evaluated, and an “achievements score” which reflects if those better practices occurred.

I was originally going to simply reply to John’s message that I thought the frequency of evaluation issue could be mitigated by these practices, as they tend to be less ephemeral than the resulting sites. Example: having 10% or more of your design and development teams made up of people with disabilities; or regularly including participation of people with disabilities in your usability testing; or writing tests that specifically consider intersectional needs are each the sort of encouraged practices that are less likely to change frequently. However, this dismisses the millions of small business sites out there that have no design or development teams or usability testing or awareness of intersectional human functional needs. It biases the model toward large organizations.

In the end, there should be a threshold score for the minimum and a way to measure anything beyond it based on human impact and not based on the resources it took the author / organization to make that impact. I can use a free theme on a free platform and get free hosting and sell widgets. If I make the buttons light grey on white, I lose points. If I write descriptions at a fifth grade reading level, I gain points. If I can’t afford to run usability testing and compensate participants, I should still be able to achieve more than a minimum score.


Charles Hall // Senior UX Architect

charles.hall@mrm-mccann.com<mailto:charles.hall@mrm-mccann.com?subject=Note%20From%20Signature>
w 248.203.8723
m 248.225.8179
360 W Maple Ave, Birmingham MI 48009
mrm-mccann.com<https://www.mrm-mccann.com/>

[MRM//McCann]
Relationship Is Our Middle Name

Ad Age Agency A-List 2016, 2017, 2019
Ad Age Creativity Innovators 2016, 2017
Ad Age B-to-B Agency of the Year 2018
North American Agency of the Year, Cannes 2016
Leader in Gartner Magic Quadrant 2017, 2018, 2019
Most Creatively Effective Agency Network in the World, Effie 2018, 2019



From: "Abma, J.D. (Jake)" <Jake.Abma@ing.com<mailto:Jake.Abma@ing.com>>
Date: Saturday, June 22, 2019 at 10:33 AM
To: John Foliot <john.foliot@deque.com<mailto:john.foliot@deque.com>>, "Hall, Charles (DET-MRM)" <Charles.Hall@mrm-mccann.com<mailto:Charles.Hall@mrm-mccann.com>>
Cc: Alastair Campbell <acampbell@nomensa.com<mailto:acampbell@nomensa.com>>, Silver Task Force <public-silver@w3.org<mailto:public-silver@w3.org>>, Andrew Kirkpatrick <akirkpat@adobe.com<mailto:akirkpat@adobe.com>>
Subject: [EXTERNAL] Re: Conformance and method 'levels'

Just some thoughts:

I do like all of the ideas from all of you but are they really feasible?

With feasible I mean in terms of time to test, money spend, the difficulty of compiling a score and the expertise to judge all of this?

I would love to see a simple framework with clear categories for valuing content, like:

  *   Original WCAG score => pass/fail                         = 67/100
  *   How often do pass/fails occur => not often / often / very often
  *   = 90/100
  *   What is the severity of the fails => not that bad / bad / blocking
  *   = 70/10
  *   How easy it is to finish a task => easy / average / hard                         = 65/100
  *   What is the quality of the translations / alternative text, etc.         = 72/100
  *   How understandable is the content => easy / average / hard
  *   = 55/100
Total = 69/100

And then also thinking about feasibility of this kind of measuring.
Questions like: will it take 6 times as long to test as an audit now? Will only a few people in the world be able to judge all categories sufficiently?

Cheers,
Jake











________________________________
From: John Foliot <john.foliot@deque.com<mailto:john.foliot@deque.com>>
Sent: Saturday, June 22, 2019 12:36 AM
To: Hall, Charles (DET-MRM)
Cc: Alastair Campbell; Silver Task Force; Andrew Kirkpatrick
Subject: Re: Conformance and method 'levels'

Hi Charles,

I for one am under the same understanding, and I see it as far more granular than just Bronze, Silver or Gold plateaus, but that rather, through the accumulation of points (by doing good things) you can advance from Bronze, to Silver to Gold - not for individual pages, but rather *for your site*.  (I've come to conceptualize it as similar to your FICO score, which numerically improves or degrades over time, yet your score is still always inside of a "range" from Bad to Excellent: increasing your score from 638 to 687 is commendable and a good stretch, yet you are still only - and remain - in the "Fair" range, so stretch harder still).

[alt: a semi-circle graph showing the 4 levels of FICO scoring: Bad, Fair, Good, and Excellent, along with the range of score values associated to each section. Bad is a range of 300 points to 629 points, Fair ranges from 630 to 689 points, Good ranges from 690 to 719 points, and excellent ranges from 720 to 850 points.]

I've also arrived at the notion that your score is never going to be a "one-and-done" numeric value, but that your score will change based on the most current data available* (in part because we all know that web sites [sic] are living breathing organic things, with content changes being pushed at regular - in some cases daily or hourly - basis.)

This then also leads me to conclude that your "Accessibility Score" will be a floating points total with those points being impacted not only by specific "techniques", but equally (if not more importantly) by functional outcomes. And so the model of:


  *   Bronze: EITHER provide AD or transcript
  *   Silver: provide AD and transcript
  *   Gold: Provide live transcript or live AD.

...feels rather simplistic to me. Much of our documentation speaks of scores (which I perceive to be numeric in nature), while what Alastair is proposing is simply Good, Better, Best - with no actual "score" involved.

Additionally, nowhere in Alastair's metric is there a measurement for "quality" of the caption, transcript or audio description (should there be? I believe yes), nor for that matter (in this particular instance) a recognition of the two very varied approaches to providing 'support assets' to the video: in-band or out-of-band (where in-band = the assets are bundled inside of the MP4 wrapper, versus out-of-band, where captions and Audio Descriptions are declared via the <track> element.) From a "functional" perspective, providing the assets in-band, while slightly harder to do production-wise, is a more robust technique (for lots of reasons), so... do we reward authors with a "better" score if they use the in-band method? And if yes, how many more "points" do they get (and why that number?) If no, why not? For transcripts, does providing the transcript as structured HTML earn you more points over providing the transcript as a .txt file?  A PDF? (WCAG 2.x doesn't seem to care about that) Should it?

(* This is already a very long email, so I will just state that I have some additional ideas about stale-dating data as well, as I suspect a cognitive walk-through result from 4 years ago likely has little-to-no value today...)

******************
In fact, if we're handing out points, how many points *do* you get for minimal functional requirement for "Accessible Media" (aka "Bronze"), and what do I need to do to increase my score to Silver (not on a single asset, but across the "range" of content - a.k.a.pages - scoped by your conformance claim) versus Gold?

Do you get the same number of points for ensuring that the language of the page has been declared (which to my mind is the easiest SC to meet) - does providing the language of the document have the same impact on users as ensuring that Audio Descriptions are present and accurate? If (like me) you believe one to be far more important than the other, how many points do either requirement start with (as a representation of "perfect" for that requirement)? For that matter, do we count up or down in our scoring (counting up = minimal score that improves, counting down = maximum score that degrades)?

(ProTip: I'd also revisit the MAUR<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.w3.org_TR_media-2Daccessibility-2Dreqs_&d=DwMGaQ&c=Ftw_YSVcGmqQBvrGwAZugGylNRkk-uER0-5bY94tjsc&r=FbsK8fvOGBHiAasJukQr6i2dv-WpJzmR-w48cl75l3c&m=BwkDmIeS0PbxmI-bwY_xZgpBtBEX7TGcdrWWrRVX-5o&s=lQYYjFMh6MH0rP57lX5cncyVi3ToRXJV0QUq4mgKu2g&e=> for ideas on how to improve your score for Accessible Media, which is more than just captions and audio description).

Then, of course, is the conundrum of "page scoring" versus "site scoring", where a video asset is (likely) displayed on a "page", and perhaps there are multiple videos on multiple pages, with accessibility support ranging from "Pretty good" on one example, to "OMG that is horrible" on another example... how do we score that on a site-level basis? If I have 5 videos on my site, and one has no captions, transcripts or Audio Descriptions (AD), two have captions and no AD or transcripts, one has captions and a transcript but no AD, and one has all the required bits (caption, AD, transcript)... what's my score? Am I Gold, Bronze, or Silver? Why?

And if I clean up 3 of those five videos above, but leave the other two as-is, do I see an increase in my score? If yes, by how much? Why? Do I get more points for cleaning up the video that lacks AD and transcript versus not as many points for cleaning up the the video that just needs audio descriptions? Does adding audio descriptions accrue more points than just adding a transcript? Can points, as numeric values, also include decimal points? (i.e. 16.25 'points' out of a maximum number available of 25)? Is this the path we are on?

Scoring is *everything* if we are moving to a Good, Better, Best model for all of our web accessibility conformance reporting. Saying you are at "Silver", without knowing explicitly how you got there will be a major hurdle that we'll need to be able to explain.

It is for these reasons that I have volunteered to help work on the conformance model, as I am of the opinion that all the other migration work will eventually run into this scoring issue as a major blocker: no matter which existing SC I consider, I soon arrive at variants of the questions above (and more), all related to scalability, techniques, impact on different user-groups, and our move from page conformance reporting to site conformance reporting, and a sliding scale of "points" that we've yet to tackle - points that will come to represent Bronze, Silver and Gold.

JF


On Fri, Jun 21, 2019 at 12:53 PM Hall, Charles (DET-MRM) <Charles.Hall@mrm-mccann.com<mailto:Charles.Hall@mrm-mccann.com>> wrote:
I understand the logical parallel.

However, my understanding (perhaps influenced by my own intent) of the point system is not directly proportional to the number of features (supported by methods) added or by the difficulty associated with adding them, but instead based on meeting functional needs. In this example, transcription, captioning and audio description (recorded) may all be implemented but still only have sufficient points to earn silver. While addressing the content itself to be more understandable by people with cognitive issues or intersectional needs would be required for sufficient points to earn gold. The difference being people and not methods.

Am I alone in this view?


Charles Hall // Senior UX Architect

charles.hall@mrm-mccann.com<mailto:charles.hall@mrm-mccann.com?subject=Note%20From%20Signature>
w 248.203.8723
m 248.225.8179
360 W Maple Ave, Birmingham MI 48009
mrm-mccann.com<https://www.mrm-mccann.com/>

Relationship Is Our Middle Name

Ad Age Agency A-List 2016, 2017, 2019
Ad Age Creativity Innovators 2016, 2017
Ad Age B-to-B Agency of the Year 2018
North American Agency of the Year, Cannes 2016
Leader in Gartner Magic Quadrant 2017, 2018, 2019
Most Creatively Effective Agency Network in the World, Effie 2018, 2019



From: Alastair Campbell <acampbell@nomensa.com<mailto:acampbell@nomensa.com>>
Date: Friday, June 21, 2019 at 12:01 PM
To: Silver Task Force <public-silver@w3.org<mailto:public-silver@w3.org>>
Subject: [EXTERNAL] Conformance and method 'levels'
Resent-From: Silver Task Force <public-silver@w3.org<mailto:public-silver@w3.org>>
Resent-Date: Friday, June 21, 2019 at 12:01 PM

Hi everyone,

I think this is a useful thread to be aware of when thinking about conformance and how different methods might be set at different levels:
https://github.com/w3c/wcag/issues/782<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_w3c_wcag_issues_782&d=DwMGaQ&c=Ftw_YSVcGmqQBvrGwAZugGylNRkk-uER0-5bY94tjsc&r=FbsK8fvOGBHiAasJukQr6i2dv-WpJzmR-w48cl75l3c&m=qRlBlL2XbaOAr9ZQ1gk036BFzRHfv3et7ZuRCfnYttk&s=81tZlSYylHRs1Awy147BMGnUzy0MuO6s7Qk5IO0FhoU&e=>

It is about multimedia access, so the 1.2.x section in WCAG 2.x. You might think that it is fairly straightforward as the solutions are fairly cut & dried (captions, transcripts, AD etc.)

However, the tricky bit is at what level you require different solutions.

If you had a guideline such as “A user does not need to see in order to understand visual multimedia content”, then Patrick’s levelling in one of the comments<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_w3c_wcag_issues_782-23issuecomment-2D504038948&d=DwMGaQ&c=Ftw_YSVcGmqQBvrGwAZugGylNRkk-uER0-5bY94tjsc&r=FbsK8fvOGBHiAasJukQr6i2dv-WpJzmR-w48cl75l3c&m=qRlBlL2XbaOAr9ZQ1gk036BFzRHfv3et7ZuRCfnYttk&s=eQu0fdZeTflKCDpdR_3mguGA09aq52UmWnQTBdPRhjE&e=> makes sense:

  *   Bronze: EITHER provide AD or transcript
  *   Silver: provide AD and transcript
  *   Gold: Provide live transcript or live AD.

I raise this as if you read the thread, you’ll see how the levels impacted the drafting of the guidelines, and I think we’ll have a similar (or more complex?) dynamic for the scoring in Silver, and how methods are drafted.

Kind regards,

-Alastair

--

www.nomensa.com<https://urldefense.proofpoint.com/v2/url?u=http-3A__www.nomensa.com_&d=DwMGaQ&c=Ftw_YSVcGmqQBvrGwAZugGylNRkk-uER0-5bY94tjsc&r=FbsK8fvOGBHiAasJukQr6i2dv-WpJzmR-w48cl75l3c&m=qRlBlL2XbaOAr9ZQ1gk036BFzRHfv3et7ZuRCfnYttk&s=KYOhqBbA2ZqPfWqucl5pHqD50APEkM1wkeBHHBrRswc&e=> / @alastc

This message contains information which may be confidential and privileged. Unless you are the intended recipient (or authorized to receive this message for the intended recipient), you may not use, copy, disseminate or disclose to anyone the message or any information contained in the message. If you have received the message in error, please advise the sender by reply e-mail, and delete the message. Thank you very much.


--
John Foliot | Principal Accessibility Strategist | W3C AC Representative
Deque Systems - Accessibility for Good
deque.com<https://urldefense.proofpoint.com/v2/url?u=http-3A__deque.com_&d=DwMGaQ&c=Ftw_YSVcGmqQBvrGwAZugGylNRkk-uER0-5bY94tjsc&r=FbsK8fvOGBHiAasJukQr6i2dv-WpJzmR-w48cl75l3c&m=BwkDmIeS0PbxmI-bwY_xZgpBtBEX7TGcdrWWrRVX-5o&s=TceA7HSWzOu1xxklWK4mDijg3GGMiBNJqWUbvslwfQw&e=>


-----------------------------------------------------------------

ATTENTION:

The information in this e-mail is confidential and only meant for the intended recipient. If you are not the intended recipient, don't use or disclose it in any way. Please let the sender know and delete the message immediately.

-----------------------------------------------------------------


--
John Foliot | Principal Accessibility Strategist | W3C AC Representative
Deque Systems - Accessibility for Good
deque.com<http://deque.com/>


-----------------------------------------------------------------
ATTENTION:
The information in this e-mail is confidential and only meant for the intended recipient. If you are not the intended recipient, don't use or disclose it in any way. Please let the sender know and delete the message immediately.
-----------------------------------------------------------------
Attachments

image/jpeg attachment: image002.jpg
Received on Monday, 24 June 2019 06:47:38 UTC