Re: Conformance and method 'levels' from John Foliot on 2019-06-24 (public-silver@w3.org from June 2019)

From: John Foliot <john.foliot@deque.com>
Date: Mon, 24 Jun 2019 11:49:31 -0500
To: Detlev Fischer <detlev.fischer@testkreis.de>
Cc: Silver TF <public-silver@w3.org>
Message-ID: <CAKdCpxxGwqLyvjcRKH_gqGS_aYhVr3Uwv7ah5_=ZLPLa6SWduQ@mail.gmail.com>
Hi Detlev,

Point taken, although at Deque, we do believe that automatic testing can
catch between 30% and 50% of page-level issues. (There are in my mind 2
types of issue: 'platform' and 'content'. Platform issues are usually
related to templates or page-structure issues (e.g. SC 1.3.1), while
content issues are, well, related to the content (e.g. the quality of an
alt text).

None-the-less, there remains, to my mind, 3 levels of 'evaluation' that are
based upon effort (whether at the development level, or at the evaluation
level - to my mind 2 sides of the same coin when it comes to accessibility
conformance), and so at the highest level, I am suggesting that those
levels of effort be a variable in our scoring mechanism, whatever form that
takes.

JF

On Mon, Jun 24, 2019 at 8:27 AM Detlev Fischer <detlev.fischer@testkreis.de>
wrote:

> Hi John
> I would take issue with the level 1. Automated tests being valid on its
> own for adding any 'points' to a score. Except perhaps for 3.1.1 Language
> of Page and 4.1.1 Parsing, nearly *all* SCs that can be partly tested
> automatically require an addtional human check or verification of
> automatic results (weeding out the false negatives as well as the false
> positives) to arrive at a valid conformance assessment. An automatic
> check can show non-conformance for pages where there are clear issues
> (field without accName, no img alt etc.) but misses instances where
> non-conformance is caused by the other half (non-descriptive alt, label, or
> captioning; wrong label referenced, wrong state communicated etc.).
> Automatic tools are great heuristic helpers but cannot be relied upon to
> determine conformance (except for a few SCs) and will have blind spots when
> used to determine non-conformance.
> Detlev
>
> Am 24.06.2019 um 01:53 schrieb John Foliot:
>
> Hi Charles,
>
> In a spread-sheet that I've circulated to a limited group, I've expanded
> on some of those thoughts.
>
> As you note, testing (and subsequently "fixing") some issues are easier
> than others. I've broken that 'development/verification' process into three
> 'buckets':
>
>    1. Automated tests: these are as simple as a "click of the button",
>    and it would give you a conformance report on those things that can be
>    automatically tested for (yes, Deque has a rules engine, but most of the
>    major testing platforms are participating in the ACT TF, where they are
>    'standardizing', if not all of the test, certainly the format by which
>    those tests are written and expressed).
>    Given that these requirements are "easy" to identify and fix, they
>    should accrue a smaller number of "points"
>
>    2. Human *Verification*: these require more of a human
>    intervention/cognition requirement (eg. verify that the alt text makes
>    sense in context, ensure that caption & audio description files are
>    accurate, verify if a table is being used for layout or for tabular data,
>    etc.). In these cases, the amount of effort and time (aka "resources") is
>    greater (measured on a scale?), and so likely occur less frequently.
>    Because these requirements are 'harder' to address (and so likely less
>    frequently), they accrue more "points" than category 1 above.
>
>    3. Finally, the 'hard' tests - the tests and requirements that require
>    human cognition and testing (i.e. cognitive walk-through's, etc.). These
>    types of verification's are complex, and 'expensive' to perform, but
>    deliver great value and really drive the site to best accessibility.
>    Because these types of tests are the hardest to perform, they also
>    accrue the most "points".
>
> (As some real-world feedback, at Deque, we have both tools and a process
> for the first two categories of testing today: our axe-core rules engine,
> which is also the heart of our free browser extensions, and another tool
> that also provides a 'guided walk-through' for verifying the rest of the SC
> that cannot be fully mechanically tested)
>
>
> > as JF  points out, means the point score is only meaningful until the
> next day the site is published.
>
>
> "Scoring" a web site the way I am envisioning would then require some kind
> of centralized data-base, and (yes) some additional tooling to process that
> data. The 'scoring tool' totals the scores from each of the three
> categories above, to arrive at the final score. As such, I envision the
> score to be dynamic in nature.
>
> Charles is right, time takes a toll on the score. The automated tests can
> be as frequent as "now", the secondary set of tests can happen with some
> frequency (weekly?), however the 'hard' tests will happen infrequently. To
> address this concern, I've also thought about perhaps "stale-dating" tests
> results, with the older they become, the less valuable they are to your
> total score.
>
> As a straw-man example, consider the third (hard) category of tests. In
> that scenario, I could envision something like losing 10% of the score
> value every 90 days (3 months). If you do a full-court press on the "hard"
> tests to get a high score on site launch (for example), but then never do
> those tests again, then over time, they deteriorate (depreciate). So if you
> score 300 points in this category on January 1st, by April first they are
> only worth 270 (or, perhaps if they scored 300 out of a possible 400, then
> they'd lose 10% of the 400 score, thus 260), by the end of 6 months they've
> depreciated by 20% (so either a score of 240 [270 - another 30 = 240], or
> 220 (260 - another 40 = 220). Even if the site remains 'static', with zero
> changes in those time-frames, you still lose the points, because while your
> site may have not evolved, the web (and techniques) do.
>
> We can model this out in different ways to see what works best, but the
> fundamental idea is that over time, you lose points because your tests
> results are getting "stale", so to keep up your score, you have to
> continually do the "harder" testing too.
>
> Thoughts?
>
> JF
>
>
>
> On Sun, Jun 23, 2019 at 4:23 PM Hall, Charles (DET-MRM) <
> Charles.Hall@mrm-mccann.com> wrote:
>
>> I like the idea of 2 currencies.
>>
>> The ephemeral nature of the web as JF  points out, means the point score
>> is only meaningful until the next day the site is published.
>>
>>
>>
>> The core problems that the proposed model are trying to solve are all the
>> “it depends” and “it’s partially conformant” cases while framing the model
>> in a way that it encourages better practices versus always aiming for the
>> minimum.
>>
>>
>>
>> With 2 currencies, we could have something of a “raw score” to be
>> regularly evaluated, and an “achievements score” which reflects if those
>> better practices occurred.
>>
>>
>>
>> I was originally going to simply reply to John’s message that I thought
>> the frequency of evaluation issue could be mitigated by these practices, as
>> they tend to be less ephemeral than the resulting sites. Example: having
>> 10% or more of your design and development teams made up of people with
>> disabilities; or regularly including participation of people with
>> disabilities in your usability testing; or writing tests that specifically
>> consider intersectional needs are each the sort of encouraged practices
>> that are less likely to change frequently. However, this dismisses the
>> millions of small business sites out there that have no design or
>> development teams or usability testing or awareness of intersectional human
>> functional needs. It biases the model toward large organizations.
>>
>>
>>
>> In the end, there should be a threshold score for the minimum and a way
>> to measure anything beyond it *based on human impact *and not based on
>> the resources it took the author / organization to make that impact. I can
>> use a free theme on a free platform and get free hosting and sell widgets.
>> If I make the buttons light grey on white, I lose points. If I write
>> descriptions at a fifth grade reading level, I gain points. If I can’t
>> afford to run usability testing and compensate participants, I should still
>> be able to achieve more than a minimum score.
>>
>>
>>
>>
>>
>> *Charles Hall* // Senior UX Architect
>>
>>
>>
>> charles.hall@mrm-mccann.com
>> <charles.hall@mrm-mccann.com?subject=Note%20From%20Signature>
>>
>> w 248.203.8723
>>
>> m 248.225.8179
>>
>> 360 W Maple Ave, Birmingham MI 48009
>>
>> mrm-mccann.com <https://www.mrm-mccann.com/>
>>
>>
>>
>> [image: MRM//McCann]
>>
>> Relationship Is Our Middle Name
>>
>>
>>
>> Ad Age Agency A-List 2016, 2017, 2019
>>
>> Ad Age Creativity Innovators 2016, 2017
>>
>> Ad Age B-to-B Agency of the Year 2018
>>
>> North American Agency of the Year, Cannes 2016
>>
>> Leader in Gartner Magic Quadrant 2017, 2018, 2019
>>
>> Most Creatively Effective Agency Network in the World, Effie 2018, 2019
>>
>>
>>
>>
>>
>>
>>
>> *From: *"Abma, J.D. (Jake)" <Jake.Abma@ing.com>
>> *Date: *Saturday, June 22, 2019 at 10:33 AM
>> *To: *John Foliot <john.foliot@deque.com>, "Hall, Charles (DET-MRM)" <
>> Charles.Hall@mrm-mccann.com>
>> *Cc: *Alastair Campbell <acampbell@nomensa.com>, Silver Task Force <
>> public-silver@w3.org>, Andrew Kirkpatrick <akirkpat@adobe.com>
>> *Subject: *[EXTERNAL] Re: Conformance and method 'levels'
>>
>>
>>
>> Just some thoughts:
>>
>>
>>
>> I do like all of the ideas from all of you but are they really feasible?
>>
>>
>>
>> With feasible I mean in terms of time to test, money spend, the
>> difficulty of compiling a score and the expertise to judge all of this?
>>
>>
>>
>> I would love to see a simple framework with clear categories for valuing
>> content, like:
>>
>>    - Original WCAG score => pass/fail                         = 67/100
>>    - How often do pass/fails occur => not often / often / very often
>>    - = 90/100
>>    - What is the severity of the fails => not that bad / bad / blocking
>>    - = 70/10
>>    - How easy it is to finish a task => easy / average / hard
>>                   = 65/100
>>    - What is the quality of the translations / alternative text, etc.
>>            = 72/100
>>    - How understandable is the content => easy / average / hard
>>    - = 55/100
>>
>> Total = 69/100
>>
>>
>>
>> And then also thinking about feasibility of this kind of measuring.
>>
>> Questions like: will it take 6 times as long to test as an audit now?
>> Will only a few people in the world be able to judge all categories
>> sufficiently?
>>
>>
>>
>> Cheers,
>>
>> Jake
>>
>> 
>>
>>
>>
>>
>>
>>
>>
>>
>> ------------------------------
>>
>> *From:* John Foliot <john.foliot@deque.com>
>> *Sent:* Saturday, June 22, 2019 12:36 AM
>> *To:* Hall, Charles (DET-MRM)
>> *Cc:* Alastair Campbell; Silver Task Force; Andrew Kirkpatrick
>> *Subject:* Re: Conformance and method 'levels'
>>
>>
>>
>> Hi Charles,
>>
>>
>>
>> I for one am under the same understanding, and I see it as far more
>> granular than just Bronze, Silver or Gold plateaus, but that rather,
>> through the accumulation of points (by doing good things) you can advance
>> from Bronze, to Silver to Gold - not for individual pages, but rather **for
>> your site**.  (I've come to conceptualize it as similar to your FICO
>> score, which numerically improves or degrades over time, yet your score is
>> still always inside of a "range" from Bad to Excellent: increasing your
>> score from 638 to 687 is commendable and a good stretch, yet you are still
>> only - and remain - in the "Fair" range, so stretch harder still).
>>
>>
>>
>> [image: image.png]
>>
>> [alt: a semi-circle graph showing the 4 levels of FICO scoring: Bad,
>> Fair, Good, and Excellent, along with the range of score values associated
>> to each section. Bad is a range of 300 points to 629 points, Fair ranges
>> from 630 to 689 points, Good ranges from 690 to 719 points, and excellent
>> ranges from 720 to 850 points.]
>>
>>
>>
>> I've also arrived at the notion that your score is never going to be a
>> "one-and-done" numeric value, but that your score will change based on the
>> most current data available* (in part because we all know that web sites
>> [sic] are living breathing organic things, with content changes being
>> pushed at regular - in some cases daily or hourly - basis.)
>>
>>
>>
>> This then also leads me to conclude that your "Accessibility Score" will
>> be a floating points total with those points being impacted not only by
>> specific "techniques", but equally (if not more importantly) by functional
>> outcomes. And so the model of:
>>
>>
>>
>>    - *Bronze: EITHER provide AD or transcript*
>>    - *Silver: provide AD and transcript*
>>    - *Gold: Provide live transcript or live AD.*
>>
>>
>>
>> ...feels rather simplistic to me. Much of our documentation *speaks of
>> scores* (which I perceive to be numeric in nature), while what Alastair
>> is proposing is simply Good, Better, Best - with no actual "score" involved.
>>
>>
>>
>> Additionally, nowhere in Alastair's metric is there a measurement for
>> "quality" of the caption, transcript or audio description (should there be?
>> I believe yes), nor for that matter (in this particular instance) a
>> recognition of the two very varied approaches to providing 'support assets'
>> to the video: in-band or out-of-band (where in-band = the assets are
>> bundled inside of the MP4 wrapper, versus out-of-band, where captions and
>> Audio Descriptions are declared via the <track> element.) From a
>> "functional" perspective, providing the assets in-band, while slightly
>> harder to do production-wise, is a more robust technique (for lots of
>> reasons), so... do we reward authors with a "better" score if they use the
>> in-band method? And if yes, how many more "points" do they get (and why
>> that number?) If no, why not? For transcripts, does providing the
>> transcript as structured HTML earn you more points over providing the
>> transcript as a .txt file?  A PDF? (WCAG 2.x doesn't seem to care about
>> that) Should it?
>>
>>
>>
>> (* This is already a very long email, so I will just state that I have
>> some additional ideas about stale-dating data as well, as I suspect a
>> cognitive walk-through result from 4 years ago likely has little-to-no
>> value today...)
>>
>>
>>
>> ******************
>>
>> In fact, if we're handing out points, how many points **do** you get for
>> minimal functional requirement for "Accessible Media" (aka "Bronze"), and
>> what do I need to do to increase my score to Silver (not on a single asset,
>> but across the "range" of content - a.k.a.pages - scoped by your
>> conformance claim) versus Gold?
>>
>>
>>
>> Do you get the same number of points for ensuring that the language of
>> the page has been declared (which to my mind is the easiest SC to meet) -
>> does providing the language of the document have the same impact on users
>> as ensuring that Audio Descriptions are present and accurate? If (like me)
>> you believe one to be far more important than the other, how many points do
>> either requirement start with (as a representation of "perfect" for that
>> requirement)? For that matter, do we count up or down in our scoring
>> (counting up = minimal score that improves, counting down = maximum score
>> that degrades)?
>>
>> (ProTip: I'd also revisit the MAUR
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.w3.org_TR_media-2Daccessibility-2Dreqs_&d=DwMGaQ&c=Ftw_YSVcGmqQBvrGwAZugGylNRkk-uER0-5bY94tjsc&r=FbsK8fvOGBHiAasJukQr6i2dv-WpJzmR-w48cl75l3c&m=BwkDmIeS0PbxmI-bwY_xZgpBtBEX7TGcdrWWrRVX-5o&s=lQYYjFMh6MH0rP57lX5cncyVi3ToRXJV0QUq4mgKu2g&e=>
>> for ideas on how to improve your score for Accessible Media, which is more
>> than just captions and audio description).
>>
>>
>>
>> Then, of course, is the conundrum of "page scoring" versus "site
>> scoring", where a video asset is (likely) displayed on a "page", and
>> perhaps there are multiple videos on multiple pages, with accessibility
>> support ranging from "Pretty good" on one example, to "OMG that is
>> horrible" on another example... how do we score that on a site-level
>> basis? If I have 5 videos on my site, and one has no captions, transcripts
>> or Audio Descriptions (AD), two have captions and no AD or transcripts, one
>> has captions and a transcript but no AD, and one has all the required bits
>> (caption, AD, transcript)... what's my score? Am I Gold, Bronze, or Silver?
>> Why?
>>
>>
>>
>> And if I clean up 3 of those five videos above, but leave the other two
>> as-is, do I see an increase in my score? If yes, by how much? Why? Do I get
>> more points for cleaning up the video that lacks AD *and* transcript
>> versus not as many points for cleaning up the the video that just needs
>> audio descriptions? Does adding audio descriptions accrue more points than
>> just adding a transcript? Can points, as numeric values, also include
>> decimal points? (i.e. 16.25 'points' out of a maximum number available of
>> 25)? Is this the path we are on?
>>
>>
>>
>> *Scoring is *everything** if we are moving to a Good, Better, Best model
>> for all of our web accessibility conformance reporting. Saying you are at
>> "Silver", without knowing explicitly how you got there will be a major
>> hurdle that we'll need to be able to explain.
>>
>>
>>
>> It is for these reasons that I have volunteered to help work on the
>> conformance model, as I am of the opinion that all the other migration work
>> will eventually run into this scoring issue as a major blocker: no matter
>> which existing SC I consider, I soon arrive at variants of the questions
>> above (and more), all related to scalability, techniques, impact on
>> different user-groups, and our move from page conformance reporting to site
>> conformance reporting, and a sliding scale of "points" that we've yet to
>> tackle - points that will come to represent Bronze, Silver and Gold.
>>
>>
>>
>> JF
>>
>>
>>
>>
>>
>> On Fri, Jun 21, 2019 at 12:53 PM Hall, Charles (DET-MRM) <
>> Charles.Hall@mrm-mccann.com> wrote:
>>
>> I understand the logical parallel.
>>
>>
>>
>> However, my understanding (perhaps influenced by my own intent) of the
>> point system is not directly proportional to the number of features
>> (supported by methods) added or by the difficulty associated with adding
>> them, but instead based on meeting functional needs. In this example,
>> transcription, captioning and audio description (recorded) may all be
>> implemented but still only have sufficient points to earn silver. While
>> addressing the content itself to be more understandable by people with
>> cognitive issues or intersectional needs would be required for sufficient
>> points to earn gold. The difference being people and not methods.
>>
>>
>>
>> Am I alone in this view?
>>
>>
>>
>>
>>
>> *Charles Hall* // Senior UX Architect
>>
>>
>>
>> charles.hall@mrm-mccann.com
>> <charles.hall@mrm-mccann.com?subject=Note%20From%20Signature>
>>
>> w 248.203.8723
>>
>> m 248.225.8179
>>
>> 360 W Maple Ave, Birmingham MI 48009
>>
>> mrm-mccann.com <https://www.mrm-mccann.com/>
>>
>>
>>
>> [image: MRM//McCann]
>>
>> Relationship Is Our Middle Name
>>
>>
>>
>> Ad Age Agency A-List 2016, 2017, 2019
>>
>> Ad Age Creativity Innovators 2016, 2017
>>
>> Ad Age B-to-B Agency of the Year 2018
>>
>> North American Agency of the Year, Cannes 2016
>>
>> Leader in Gartner Magic Quadrant 2017, 2018, 2019
>>
>> Most Creatively Effective Agency Network in the World, Effie 2018, 2019
>>
>>
>>
>>
>>
>>
>>
>> *From: *Alastair Campbell <acampbell@nomensa.com>
>> *Date: *Friday, June 21, 2019 at 12:01 PM
>> *To: *Silver Task Force <public-silver@w3.org>
>> *Subject: *[EXTERNAL] Conformance and method 'levels'
>> *Resent-From: *Silver Task Force <public-silver@w3.org>
>> *Resent-Date: *Friday, June 21, 2019 at 12:01 PM
>>
>>
>>
>> Hi everyone,
>>
>>
>>
>> I think this is a useful thread to be aware of when thinking about
>> conformance and how different methods might be set at different levels:
>>
>> https://github.com/w3c/wcag/issues/782
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_w3c_wcag_issues_782&d=DwMGaQ&c=Ftw_YSVcGmqQBvrGwAZugGylNRkk-uER0-5bY94tjsc&r=FbsK8fvOGBHiAasJukQr6i2dv-WpJzmR-w48cl75l3c&m=qRlBlL2XbaOAr9ZQ1gk036BFzRHfv3et7ZuRCfnYttk&s=81tZlSYylHRs1Awy147BMGnUzy0MuO6s7Qk5IO0FhoU&e=>
>>
>>
>>
>> It is about multimedia access, so the 1.2.x section in WCAG 2.x. You
>> might think that it is fairly straightforward as the solutions are fairly
>> cut & dried (captions, transcripts, AD etc.)
>>
>>
>>
>> However, the tricky bit is at what level you require different solutions.
>>
>>
>>
>> If you had a guideline such as “A user does not need to see in order to
>> understand visual multimedia content”, then Patrick’s levelling in one
>> of the comments
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_w3c_wcag_issues_782-23issuecomment-2D504038948&d=DwMGaQ&c=Ftw_YSVcGmqQBvrGwAZugGylNRkk-uER0-5bY94tjsc&r=FbsK8fvOGBHiAasJukQr6i2dv-WpJzmR-w48cl75l3c&m=qRlBlL2XbaOAr9ZQ1gk036BFzRHfv3et7ZuRCfnYttk&s=eQu0fdZeTflKCDpdR_3mguGA09aq52UmWnQTBdPRhjE&e=>
>> makes sense:
>>
>>    - Bronze: EITHER provide AD or transcript
>>    - Silver: provide AD and transcript
>>    - Gold: Provide live transcript or live AD.
>>
>>
>>
>> I raise this as if you read the thread, you’ll see how the levels
>> impacted the drafting of the guidelines, and I think we’ll have a similar
>> (or more complex?) dynamic for the scoring in Silver, and how methods are
>> drafted.
>>
>>
>>
>> Kind regards,
>>
>>
>>
>> -Alastair
>>
>>
>>
>> --
>>
>>
>>
>> www.nomensa.com
>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.nomensa.com_&d=DwMGaQ&c=Ftw_YSVcGmqQBvrGwAZugGylNRkk-uER0-5bY94tjsc&r=FbsK8fvOGBHiAasJukQr6i2dv-WpJzmR-w48cl75l3c&m=qRlBlL2XbaOAr9ZQ1gk036BFzRHfv3et7ZuRCfnYttk&s=KYOhqBbA2ZqPfWqucl5pHqD50APEkM1wkeBHHBrRswc&e=>
>> / @alastc
>>
>>
>>
>> This message contains information which may be confidential and
>> privileged. Unless you are the intended recipient (or authorized to receive
>> this message for the intended recipient), you may not use, copy,
>> disseminate or disclose to anyone the message or any information contained
>> in the message. If you have received the message in error, please advise
>> the sender by reply e-mail, and delete the message. Thank you very much.
>>
>>
>>
>>
>> --
>>
>> *John Foliot* | Principal Accessibility Strategist | W3C AC
>> Representative
>> Deque Systems - Accessibility for Good
>> deque.com
>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__deque.com_&d=DwMGaQ&c=Ftw_YSVcGmqQBvrGwAZugGylNRkk-uER0-5bY94tjsc&r=FbsK8fvOGBHiAasJukQr6i2dv-WpJzmR-w48cl75l3c&m=BwkDmIeS0PbxmI-bwY_xZgpBtBEX7TGcdrWWrRVX-5o&s=TceA7HSWzOu1xxklWK4mDijg3GGMiBNJqWUbvslwfQw&e=>
>>
>>
>>
>> -----------------------------------------------------------------
>>
>> ATTENTION:
>>
>> The information in this e-mail is confidential and only meant for the intended recipient. If you are not the intended recipient, don't use or disclose it in any way. Please let the sender know and delete the message immediately.
>>
>> -----------------------------------------------------------------
>>
>>
>
> --
> *John Foliot* | Principal Accessibility Strategist | W3C AC
> Representative
> Deque Systems - Accessibility for Good
> deque.com
>
>
> --
> Detlev Fischer
> Testkreis
> Werderstr. 34, 20144 Hamburg
>
> Mobil +49 (0)157 57 57 57 45
> http://www.testkreis.de
> Beratung, Tests und Schulungen für barrierefreie Websites
>
>

-- 
*John Foliot* | Principal Accessibility Strategist | W3C AC Representative
Deque Systems - Accessibility for Good
deque.com
Attachments

image/jpeg attachment: image001.jpg
Received on Monday, 24 June 2019 16:50:39 UTC