Re: Conformance and method 'levels' from Detlev Fischer on 2019-06-24 (public-silver@w3.org from June 2019)

From: Detlev Fischer <detlev.fischer@testkreis.de>
Date: Mon, 24 Jun 2019 15:26:16 +0200
To: public-silver@w3.org
Message-ID: <673d7005-0608-74b9-0706-7ea0df265794@testkreis.de>
Hi John
I would take issue with the level 1. Automated tests being valid on its 
own for adding any 'points' to a score. Except perhaps for 3.1.1 
Language of Page and 4.1.1 Parsing, nearly *all* SCs that can be partly 
tested automatically require an addtional human check or verificationof 
automatic results(weeding out the false negatives as well as the false 
positives)to arrive at a valid conformance assessment. An automatic 
check can show non-conformance for pages where there are clear issues 
(field without accName, no img alt etc.) but misses instances where 
non-conformance is caused by the other half (non-descriptive alt, label, 
or captioning; wrong label referenced, wrong state communicated etc.). 
Automatic tools are great heuristic helpers but cannot be relied upon to 
determine conformance (except for a few SCs) and will have blind spots 
when used to determine non-conformance.
Detlev

Am 24.06.2019 um 01:53 schrieb John Foliot:
> Hi Charles,
>
> In a spread-sheet that I've circulated to a limited group, I've 
> expanded on some of those thoughts.
>
> As you note, testing (and subsequently "fixing") some issues are 
> easier than others. I've broken that 'development/verification' 
> process into three 'buckets':
>
>  1. Automated tests: these are as simple as a "click of the button",
>     and it would give you a conformance report on those things that
>     can be automatically tested for (yes, Deque has a rules engine,
>     but most of the major testing platforms are participating in the
>     ACT TF, where they are 'standardizing', if not all of the test,
>     certainly the format by which those tests are written and expressed).
>     Given that these requirements are "easy" to identify and fix, they
>     should accrue a smaller number of "points"
>
>  2. Human *Verification*: these require more of a human
>     intervention/cognition requirement (eg. verify that the alt text
>     makes sense in context, ensure that caption & audio description
>     files are accurate, verify if a table is being used for layout or
>     for tabular data, etc.). In these cases, the amount of effort and
>     time (aka "resources") is greater (measured on a scale?), and so
>     likely occur less frequently.
>     Because these requirements are 'harder' to address (and so likely
>     less frequently), they accrue more "points" than category 1 above.
>
>  3. Finally, the 'hard' tests - the tests and requirements that
>     require human cognition and testing (i.e. cognitive
>     walk-through's, etc.). These types of verification's are complex,
>     and 'expensive' to perform, but deliver great value and really
>     drive the site to best accessibility.
>     Because these types of tests are the hardest to perform, they also
>     accrue the most "points".
>
> (As some real-world feedback, at Deque, we have both tools and a 
> process for the first two categories of testing today: our axe-core 
> rules engine, which is also the heart of our free browser extensions, 
> and another tool that also provides a 'guided walk-through' for 
> verifying the rest of the SC that cannot be fully mechanically tested)
>
>
>     > as JF  points out, means the point score is only meaningful
>     until the next day the site is published.
>
>
> "Scoring" a web site the way I am envisioning would then require some 
> kind of centralized data-base, and (yes) some additional tooling to 
> process that data. The 'scoring tool' totals the scores from each of 
> the three categories above, to arrive at the final score. As such, I 
> envision the score to be dynamic in nature.
>
> Charles is right, time takes a toll on the score. The automated tests 
> can be as frequent as "now", the secondary set of tests can happen 
> with some frequency (weekly?), however the 'hard' tests will happen 
> infrequently. To address this concern, I've also thought about perhaps 
> "stale-dating" tests results, with the older they become, the less 
> valuable they are to your total score.
>
> As a straw-man example, consider the third (hard) category of tests. 
> In that scenario, I could envision something like losing 10% of the 
> score value every 90 days (3 months). If you do a full-court press on 
> the "hard" tests to get a high score on site launch (for example), but 
> then never do those tests again, then over time, they deteriorate 
> (depreciate). So if you score 300 points in this category on January 
> 1st, by April first they are only worth 270 (or, perhaps if they 
> scored 300 out of a possible 400, then they'd lose 10% of the 400 
> score, thus 260), by the end of 6 months they've depreciated by 20% 
> (so either a score of 240 [270 - another 30 = 240], or 220 (260 - 
> another 40 = 220). Even if the site remains 'static', with zero 
> changes in those time-frames, you still lose the points, because while 
> your site may have not evolved, the web (and techniques) do.
>
> We can model this out in different ways to see what works best, but 
> the fundamental idea is that over time, you lose points because your 
> tests results are getting "stale", so to keep up your score, you have 
> to continually do the "harder" testing too.
>
> Thoughts?
>
> JF
>
>
>
> On Sun, Jun 23, 2019 at 4:23 PM Hall, Charles (DET-MRM) 
> <Charles.Hall@mrm-mccann.com <mailto:Charles.Hall@mrm-mccann.com>> wrote:
>
>     I like the idea of 2 currencies.
>
>     The ephemeral nature of the web as JF  points out, means the point
>     score is only meaningful until the next day the site is published.
>
>     The core problems that the proposed model are trying to solve are
>     all the “it depends” and “it’s partially conformant” cases while
>     framing the model in a way that it encourages better practices
>     versus always aiming for the minimum.
>
>     With 2 currencies, we could have something of a “raw score” to be
>     regularly evaluated, and an “achievements score” which reflects if
>     those better practices occurred.
>
>     I was originally going to simply reply to John’s message that I
>     thought the frequency of evaluation issue could be mitigated by
>     these practices, as they tend to be less ephemeral than the
>     resulting sites. Example: having 10% or more of your design and
>     development teams made up of people with disabilities; or
>     regularly including participation of people with disabilities in
>     your usability testing; or writing tests that specifically
>     consider intersectional needs are each the sort of encouraged
>     practices that are less likely to change frequently. However, this
>     dismisses the millions of small business sites out there that have
>     no design or development teams or usability testing or awareness
>     of intersectional human functional needs. It biases the model
>     toward large organizations.
>
>     In the end, there should be a threshold score for the minimum and
>     a way to measure anything beyond it *based on human impact *and
>     not based on the resources it took the author / organization to
>     make that impact. I can use a free theme on a free platform and
>     get free hosting and sell widgets. If I make the buttons light
>     grey on white, I lose points. If I write descriptions at a fifth
>     grade reading level, I gain points. If I can’t afford to run
>     usability testing and compensate participants, I should still be
>     able to achieve more than a minimum score.
>
>     *Charles Hall* // Senior UX Architect
>
>     charles.hall@mrm-mccann.com
>     <mailto:charles.hall@mrm-mccann.com?subject=Note%20From%20Signature>
>
>     w 248.203.8723
>
>     m 248.225.8179
>
>     360 W Maple Ave, Birmingham MI 48009
>
>     mrm-mccann.com <https://www.mrm-mccann.com/>
>
>     MRM//McCann
>
>     Relationship Is Our Middle Name
>
>     Ad Age Agency A-List 2016, 2017, 2019
>
>     Ad Age Creativity Innovators 2016, 2017
>
>     Ad Age B-to-B Agency of the Year 2018
>
>     North American Agency of the Year, Cannes 2016
>
>     Leader in Gartner Magic Quadrant 2017, 2018, 2019
>
>     Most Creatively Effective Agency Network in the World, Effie 2018,
>     2019
>
>     *From: *"Abma, J.D. (Jake)" <Jake.Abma@ing.com
>     <mailto:Jake.Abma@ing.com>>
>     *Date: *Saturday, June 22, 2019 at 10:33 AM
>     *To: *John Foliot <john.foliot@deque.com
>     <mailto:john.foliot@deque.com>>, "Hall, Charles (DET-MRM)"
>     <Charles.Hall@mrm-mccann.com <mailto:Charles.Hall@mrm-mccann.com>>
>     *Cc: *Alastair Campbell <acampbell@nomensa.com
>     <mailto:acampbell@nomensa.com>>, Silver Task Force
>     <public-silver@w3.org <mailto:public-silver@w3.org>>, Andrew
>     Kirkpatrick <akirkpat@adobe.com <mailto:akirkpat@adobe.com>>
>     *Subject: *[EXTERNAL] Re: Conformance and method 'levels'
>
>     Just some thoughts:
>
>     I do like all of the ideas from all of you but are they really
>     feasible?
>
>     With feasible I mean in terms of time to test, money spend, the
>     difficulty of compiling a score and the expertise to judge all of
>     this?
>
>     I would love to see a simple framework with clear categories for
>     valuing content, like:
>
>       * Original WCAG score => pass/fail                         =
>         67/100
>       * How often do pass/fails occur => not often / often / very often
>       * = 90/100
>       * What is the severity of the fails => not that bad / bad /
>         blocking
>       * = 70/10
>       * How easy it is to finish a task => easy / average / hard    
>              = 65/100
>       * What is the quality of the translations / alternative text,
>         etc.         = 72/100
>       * How understandable is the content => easy / average / hard
>       * = 55/100
>
>     Total = 69/100
>
>     And then also thinking about feasibility of this kind of measuring.
>
>     Questions like: will it take 6 times as long to test as an audit
>     now? Will only a few people in the world be able to judge all
>     categories sufficiently?
>
>     Cheers,
>
>     Jake
>
>     
>
>     ------------------------------------------------------------------------
>
>     *From:*John Foliot <john.foliot@deque.com
>     <mailto:john.foliot@deque.com>>
>     *Sent:* Saturday, June 22, 2019 12:36 AM
>     *To:* Hall, Charles (DET-MRM)
>     *Cc:* Alastair Campbell; Silver Task Force; Andrew Kirkpatrick
>     *Subject:* Re: Conformance and method 'levels'
>
>     Hi Charles,
>
>     I for one am under the same understanding, and I see it as far
>     more granular than just Bronze, Silver or Gold plateaus, but that
>     rather, through the accumulation of points (by doing good things)
>     you can advance from Bronze, to Silver to Gold - not for
>     individual pages, but rather **for your site**.  (I've come to
>     conceptualize it as similar to your FICO score, which numerically
>     improves or degrades over time, yet your score is still always
>     inside of a "range" from Bad to Excellent: increasing your score
>     from 638 to 687 is commendable and a good stretch, yet you are
>     still only - and remain - in the "Fair" range, so stretch harder
>     still).
>
>     image.png
>
>     [alt: a semi-circle graph showing the 4 levels of FICO scoring:
>     Bad, Fair, Good, and Excellent, along with the range of score
>     values associated to each section. Bad is a range of 300 points to
>     629 points, Fair ranges from 630 to 689 points, Good ranges from
>     690 to 719 points, and excellent ranges from 720 to 850 points.]
>
>     I've also arrived at the notion that your score is never going to
>     be a "one-and-done" numeric value, but that your score will change
>     based on the most current data available* (in part because we all
>     know that web sites [sic] are living breathing organic things,
>     with content changes being pushed at regular - in some cases daily
>     or hourly - basis.)
>
>     This then also leads me to conclude that your "Accessibility
>     Score" will be a floating points total with those points being
>     impacted not only by specific "techniques", but equally (if not
>     more importantly) by functional outcomes. And so the model of:
>
>       * /Bronze: EITHER provide AD or transcript/
>       * /Silver: provide AD and transcript/
>       * /Gold: Provide live transcript or live AD./
>
>     ...feels rather simplistic to me. Much of our documentation
>     */_speaks of scores_/* (which I perceive to be numeric in nature),
>     while what Alastair is proposing is simply Good, Better, Best -
>     with no actual "score" involved.
>
>     Additionally, nowhere in Alastair's metric is there a measurement
>     for "quality" of the caption, transcript or audio description
>     (should there be? I believe yes), nor for that matter (in this
>     particular instance) a recognition of the two very varied
>     approaches to providing 'support assets' to the video: in-band or
>     out-of-band (where in-band = the assets are bundled inside of the
>     MP4 wrapper, versus out-of-band, where captions and Audio
>     Descriptions are declared via the <track> element.) From a
>     "functional" perspective, providing the assets in-band, while
>     slightly harder to do production-wise, is a more robust technique
>     (for lots of reasons), so... do we reward authors with a "better"
>     score if they use the in-band method? And if yes, how many more
>     "points" do they get (and why that number?) If no, why not? For
>     transcripts, does providing the transcript as structured HTML earn
>     you more points over providing the transcript as a .txt file?  A
>     PDF? (WCAG 2.x doesn't seem to care about that) Should it?
>
>     (* This is already a very long email, so I will just state that I
>     have some additional ideas about stale-dating data as well, as I
>     suspect a cognitive walk-through result from 4 years ago likely
>     has little-to-no value today...)
>
>     ******************
>
>     In fact, if we're handing out points, how many points **do** you
>     get for minimal functional requirement for "Accessible Media" (aka
>     "Bronze"), and what do I need to do to increase my score to Silver
>     (not on a single asset, but across the "range" of content -
>     a.k.a.pages - scoped by your conformance claim) versus Gold?
>
>     Do you get the same number of points for ensuring that the
>     language of the page has been declared (which to my mind is the
>     easiest SC to meet) - does providing the language of the document
>     have the same impact on users as ensuring that Audio Descriptions
>     are present and accurate? If (like me) you believe one to be far
>     more important than the other, how many points do either
>     requirement start with (as a representation of "perfect" for that
>     requirement)? For that matter, do we count up or down in our
>     scoring (counting up = minimal score that improves, counting down
>     = maximum score that degrades)?
>
>     (ProTip: I'd also revisit the MAUR
>     <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.w3.org_TR_media-2Daccessibility-2Dreqs_&d=DwMGaQ&c=Ftw_YSVcGmqQBvrGwAZugGylNRkk-uER0-5bY94tjsc&r=FbsK8fvOGBHiAasJukQr6i2dv-WpJzmR-w48cl75l3c&m=BwkDmIeS0PbxmI-bwY_xZgpBtBEX7TGcdrWWrRVX-5o&s=lQYYjFMh6MH0rP57lX5cncyVi3ToRXJV0QUq4mgKu2g&e=>
>     for ideas on how to improve your score for Accessible Media, which
>     is more than just captions and audio description).
>
>     Then, of course, is the conundrum of "page scoring" versus "site
>     scoring", where a video asset is (likely) displayed on a "page",
>     and perhaps there are multiple videos on multiple pages, with
>     accessibility support ranging from "Pretty good" on one example,
>     to "OMG that is horrible" on another example... how do we score
>     that on a site-level basis? If I have 5 videos on my site, and one
>     has no captions, transcripts or Audio Descriptions (AD), two have
>     captions and no AD or transcripts, one has captions and a
>     transcript but no AD, and one has all the required bits (caption,
>     AD, transcript)... what's my score? Am I Gold, Bronze, or Silver?
>     Why?
>
>     And if I clean up 3 of those five videos above, but leave the
>     other two as-is, do I see an increase in my score? If yes, by how
>     much? Why? Do I get more points for cleaning up the video that
>     lacks AD _and_ transcript versus not as many points for cleaning
>     up the the video that just needs audio descriptions? Does adding
>     audio descriptions accrue more points than just adding a
>     transcript? Can points, as numeric values, also include decimal
>     points? (i.e. 16.25 'points' out of a maximum number available of
>     25)? Is this the path we are on?
>
>     *Scoring is *everything**if we are moving to a Good, Better, Best
>     model for all of our web accessibility conformance reporting.
>     Saying you are at "Silver", without knowing explicitly how you got
>     there will be a major hurdle that we'll need to be able to explain.
>
>     It is for these reasons that I have volunteered to help work on
>     the conformance model, as I am of the opinion that all the other
>     migration work will eventually run into this scoring issue as a
>     major blocker: no matter which existing SC I consider, I soon
>     arrive at variants of the questions above (and more), all related
>     to scalability, techniques, impact on different user-groups, and
>     our move from page conformance reporting to site conformance
>     reporting, and a sliding scale of "points" that we've yet to
>     tackle - points that will come to represent Bronze, Silver and Gold.
>
>     JF
>
>     On Fri, Jun 21, 2019 at 12:53 PM Hall, Charles (DET-MRM)
>     <Charles.Hall@mrm-mccann.com <mailto:Charles.Hall@mrm-mccann.com>>
>     wrote:
>
>         I understand the logical parallel.
>
>         However, my understanding (perhaps influenced by my own
>         intent) of the point system is not directly proportional to
>         the number of features (supported by methods) added or by the
>         difficulty associated with adding them, but instead based on
>         meeting functional needs. In this example, transcription,
>         captioning and audio description (recorded) may all be
>         implemented but still only have sufficient points to earn
>         silver. While addressing the content itself to be more
>         understandable by people with cognitive issues or
>         intersectional needs would be required for sufficient points
>         to earn gold. The difference being people and not methods.
>
>         Am I alone in this view?
>
>         *Charles Hall* // Senior UX Architect
>
>         charles.hall@mrm-mccann.com
>         <mailto:charles.hall@mrm-mccann.com?subject=Note%20From%20Signature>
>
>         w 248.203.8723
>
>         m 248.225.8179
>
>         360 W Maple Ave, Birmingham MI 48009
>
>         mrm-mccann.com <https://www.mrm-mccann.com/>
>
>         MRM//McCann
>
>         Relationship Is Our Middle Name
>
>         Ad Age Agency A-List 2016, 2017, 2019
>
>         Ad Age Creativity Innovators 2016, 2017
>
>         Ad Age B-to-B Agency of the Year 2018
>
>         North American Agency of the Year, Cannes 2016
>
>         Leader in Gartner Magic Quadrant 2017, 2018, 2019
>
>         Most Creatively Effective Agency Network in the World, Effie
>         2018, 2019
>
>         *From: *Alastair Campbell <acampbell@nomensa.com
>         <mailto:acampbell@nomensa.com>>
>         *Date: *Friday, June 21, 2019 at 12:01 PM
>         *To: *Silver Task Force <public-silver@w3.org
>         <mailto:public-silver@w3.org>>
>         *Subject: *[EXTERNAL] Conformance and method 'levels'
>         *Resent-From: *Silver Task Force <public-silver@w3.org
>         <mailto:public-silver@w3.org>>
>         *Resent-Date: *Friday, June 21, 2019 at 12:01 PM
>
>         Hi everyone,
>
>         I think this is a useful thread to be aware of when thinking
>         about conformance and how different methods might be set at
>         different levels:
>
>         https://github.com/w3c/wcag/issues/782
>         <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_w3c_wcag_issues_782&d=DwMGaQ&c=Ftw_YSVcGmqQBvrGwAZugGylNRkk-uER0-5bY94tjsc&r=FbsK8fvOGBHiAasJukQr6i2dv-WpJzmR-w48cl75l3c&m=qRlBlL2XbaOAr9ZQ1gk036BFzRHfv3et7ZuRCfnYttk&s=81tZlSYylHRs1Awy147BMGnUzy0MuO6s7Qk5IO0FhoU&e=>
>
>
>         It is about multimedia access, so the 1.2.x section in WCAG
>         2.x. You might think that it is fairly straightforward as the
>         solutions are fairly cut & dried (captions, transcripts, AD etc.)
>
>         However, the tricky bit is at what level you require different
>         solutions.
>
>         If you had a guideline such as “A user does not need to see in
>         order to understand visual multimedia content”, then Patrick’s
>         levelling in one of the comments
>         <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_w3c_wcag_issues_782-23issuecomment-2D504038948&d=DwMGaQ&c=Ftw_YSVcGmqQBvrGwAZugGylNRkk-uER0-5bY94tjsc&r=FbsK8fvOGBHiAasJukQr6i2dv-WpJzmR-w48cl75l3c&m=qRlBlL2XbaOAr9ZQ1gk036BFzRHfv3et7ZuRCfnYttk&s=eQu0fdZeTflKCDpdR_3mguGA09aq52UmWnQTBdPRhjE&e=>
>         makes sense:
>
>           * Bronze: EITHER provide AD or transcript
>           * Silver: provide AD and transcript
>           * Gold: Provide live transcript or live AD.
>
>         I raise this as if you read the thread, you’ll see how the
>         levels impacted the drafting of the guidelines, and I think
>         we’ll have a similar (or more complex?) dynamic for the
>         scoring in Silver, and how methods are drafted.
>
>         Kind regards,
>
>         -Alastair
>
>         -- 
>
>         www.nomensa.com
>         <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.nomensa.com_&d=DwMGaQ&c=Ftw_YSVcGmqQBvrGwAZugGylNRkk-uER0-5bY94tjsc&r=FbsK8fvOGBHiAasJukQr6i2dv-WpJzmR-w48cl75l3c&m=qRlBlL2XbaOAr9ZQ1gk036BFzRHfv3et7ZuRCfnYttk&s=KYOhqBbA2ZqPfWqucl5pHqD50APEkM1wkeBHHBrRswc&e=>
>         / @alastc
>
>         This message contains information which may be confidential
>         and privileged. Unless you are the intended recipient (or
>         authorized to receive this message for the intended
>         recipient), you may not use, copy, disseminate or disclose to
>         anyone the message or any information contained in the
>         message. If you have received the message in error, please
>         advise the sender by reply e-mail, and delete the message.
>         Thank you very much.
>
>
>     -- 
>
>     *John Foliot* | Principal Accessibility Strategist | W3C AC
>     Representative
>     Deque Systems - Accessibility for Good
>     deque.com
>     <https://urldefense.proofpoint.com/v2/url?u=http-3A__deque.com_&d=DwMGaQ&c=Ftw_YSVcGmqQBvrGwAZugGylNRkk-uER0-5bY94tjsc&r=FbsK8fvOGBHiAasJukQr6i2dv-WpJzmR-w48cl75l3c&m=BwkDmIeS0PbxmI-bwY_xZgpBtBEX7TGcdrWWrRVX-5o&s=TceA7HSWzOu1xxklWK4mDijg3GGMiBNJqWUbvslwfQw&e=>
>
>     -----------------------------------------------------------------
>
>     ATTENTION:
>
>     The information in this e-mail is confidential and only meant for the intended recipient. If you are not the intended recipient, don't use or disclose it in any way. Please let the sender know and delete the message immediately.
>
>     -----------------------------------------------------------------
>
>
>
> -- 
> *John Foliot* | Principal Accessibility Strategist | W3C AC 
> Representative
> Deque Systems - Accessibility for Good
> deque.com <http://deque.com/>
>

-- 
Detlev Fischer
Testkreis
Werderstr. 34, 20144 Hamburg

Mobil +49 (0)157 57 57 57 45

http://www.testkreis.de
Beratung, Tests und Schulungen für barrierefreie Websites
Received on Monday, 24 June 2019 13:26:46 UTC