Re: Scoring and Dashboards from John Foliot on 2020-05-13 (public-silver@w3.org from May 2020)

From: John Foliot <john.foliot@deque.com>
Date: Wed, 13 May 2020 10:06:48 -0500
To: jake abma <jake.abma@gmail.com>
Cc: Rachael Montgomery <rachael@accessiblecommunity.org>, Silver TF <public-silver@w3.org>, WCAG <w3c-wai-gl@w3.org>
Message-ID: <CAKdCpxyTdhhSRoKaHYaRRpqv8=36NHBLDT1CNQg4M1Qte+rBQw@mail.gmail.com>
Hi Jake,

Thanks for this. I will attempt to reply to each bullet point:

- *All re-used / included test results need to be re-evaluated if older
than a year and only be used if they are still actual (like a car /
elevator check)*

Yes, this is essentially what I am advocating for. The "simple" mechanical
tests (automated) are often nothing more than a click of a button, so
getting 'accurate' test results that way is, today, relatively trivial. The
harder part is the more sophisticated/complex tests that
require user-testing. Like the examples you provided (car/elevator), there
will be times when, whether or not it still "works" certain techniques will
become less stable/useful/supported over time, and (I argue) should be less
"valuable" to your overall score.

In our Requirements Document we note:  "*A flexible structure enables
greater scaling of new methods to meet guidelines. It also allows for the
expiration of outdated methods to meet guidelines. A flexible process of
updating methods enables the overall guidance to keep pace with technology.*"
 (source:
https://w3c.github.io/silver/requirements/index.html#oppotunities_maintenance)
Thus, I argue that one way of "expiring" those outdated methods is to make
them less valuable over time, especially if they are not being retested
with the same rigor as the initial testing (which, without a motivating
factor, will likely be the case more often than not)


- *Constant small changes need to be judged in every new conformance claim
and if they are responsible for re-designs in time periods shorter than a
year, the re-evaluation needs to be shorter, like every quarter*

So, one of the things that became apparent to me is that the scope of
"testing" also varies depending on need and role. When we spoke about
'scoping' last week, the scenarios being presented were from the context of
the content creator at production time. That is certainly one kind of test
and report (claim).

However, there are other types of claims and needs, including a "legal
conformance claim" that different territories may mandate (and I think we
need to give that space enough room to find its own level-set). Some
countries may require annual reports, other countries may choose a 6-month
time-frame, and yet another country may peg it at 36 months. No matter
what, our scoring mechanism should be able to accommodate those milestone
dates using the same methodology.

The third scoring scenario, the one I am attempting to address/recognize
here, is the on-going monitoring of a web-property's overall "accessibility
health", which is what all of the Dashboard examples I previously provided
are attempting to address. I will further assert that this is a really
beneficial way of advancing our cause, as issues are noted when the crop
up, and not simply "discovered' during the annual [sic] audit process.


- *Tests need to be based on / checked for the technology present / used
and available at the time of the conformance claim*

On the surface that makes sense. I will note that one of the other
advantages of a mechanism that supports a "running score" is that these
gaps will show up sooner, and will likely be remediated far quicker, simply
due to the enhanced visibility that the dashboard view provides. In the
visual examples I previously provided
<https://docs.google.com/document/d/1PgmVS0s8_klxvV2ImZS1GRXHwUgKkoXQ1_y6RBMIZQw/edit#heading=h.s2f3av3tk0j8>,
the screen capture of Level Access' AMP dashboard shows that the mythical
Acme.gov site is set for a monthly 'crawl', so presumably issues that
cropped up in "April 2020" will be found and reported in "May 2020".

*The net result is that issues that impact our users are found faster, and
remediated faster, which HAS to be a positive net benefit.*


- *After re-designs claims are not valid anymore and need to be done again
for a new conformance claim*

While I agree with that in principle, this is also where I will suggest
that a decision like that is squarely in the regulators camp: our scoring
mechanism is based on test results run at the date of "time-stamp" (as you
noted), and scoped to the content tested (which appears to have consensus
already, but TBC). One outcome of letting the content owners do their own
scoping is, whether we as a WG want to recognize it or not, the regulatory
environment is going to want to scope it at the 'domain/site' level (i.e. "
w3.org", as opposed to "w3.org/WAI/roles/policy-makers/"). However, with a
robust and flexible scoring mechanism, that should not matter as much to us
- from my perspective testing frequently and remediating issues ongoing is
the better path forward no matter what.


- Application with lots of changes in a short amount of time like blogs
etc. need special attention like an exception or special approach to be
discussed.

While *change frequency* will need to be addressed, I believe the issue is
more closely related to the two types of 'traditional' issues we see:
mechanical/code-based issues, and editorial/author-based issues.

Of the two, the trickier category is the author-based issues (which not
coincidentally, many of the ongoing COGA issues are found today).
Interestingly enough, if we can get testing there 'right', on-going
Dashboard Monitor reports will help surface author issues at each "scan",
and organizations can certainly time their scans to meet their own internal
needs; whether they run their scan monthly, weekly, daily or hourly will be
their choice.

That's because they would be using their "running tally score" to make
things better, and NOT just to issue a report.

JF

On Wed, May 13, 2020 at 5:52 AM jake abma <jake.abma@gmail.com> wrote:

> Hi John / Rachael,
>
> See this one and matches my previous mail.
>
> This is already how this is required for the public sector in Europe
> (government and municipalities that have to inspect and rapport their
> applications every "X" period)
> So we can look at how this is done and align the needs for claims with
> that approach.
>
> Jake
>
>
> Op di 12 mei 2020 om 18:57 schreef John Foliot <john.foliot@deque.com>:
>
>> Hi Rachael,
>>
>> > If we dictate vs acknowledge depreciation, how do you propose we
>> address organization’s’ differences in development and deployment
>>  environments and cycles?
>>
>> We don't. We simply state that in the W3C's WCAG 3.0 conformance model,
>> those higher-order 'user-tests' will need to be re-run after "X" period of
>> time *if you want to report using the W3C model. *
>>
>> I acknowledge that it will likely be an arbitrary decision in some ways,
>> but I think that collectively we can arrive at a consensus period of time
>> (I'll suggest 2 years, based in large part on the fact that our WG has
>> previously agreed to publish updates every 2 years, but that is just a
>> suggestion).
>>
>> Presuming that legislators take-up our new specification (which is STILL
>> not a given), organizations will adapt to the new reality as part of their
>> legal obligations. They can manage that as they see fit, based on their
>> ecosystem.
>>
>> JF
>>
>> On Tue, May 12, 2020 at 11:25 AM Rachael Montgomery <
>> rachael@accessiblecommunity.org> wrote:
>>
>>> John,
>>>
>>> If we dictate vs acknowledge depreciation, how do you propose we address
>>> organization’s’ differences in development and deployment  environments and
>>> cycles?   What is an appropriate time frame that works for everyone
>>> (internal, external, agile, waterfall, etc)?
>>>
>>> This is the step I personally can’t figure out on anything but an
>>> organization-by-organization basis.   I understand your reasoning behind
>>> this conversation but I am not sure how to resolve the variability in a way
>>> that lets us create normative guidance in this area.
>>>
>>> Regards,
>>>
>>> Rachael
>>> On May 12, 2020, 12:00 PM -0400, John Foliot <john.foliot@deque.com>,
>>> wrote:
>>>
>>> Hi Jake and Rachael,
>>>
>>> First, I am proposing 'depreciation' and NOT 'deterioration', a subtle
>>> but important distinction.
>>>
>>> Like automobiles (which depreciate over time), I am simply arguing that
>>> any test, from the simplest mechanical test to the most sophisticated
>>> cognitive walkthrough / user-path / testing with PwD, will over time be
>>> increasingly less accurate, and thus less valuable.
>>>
>>> To extend the analogy:
>>>
>>>    - I have a three year old Honda.
>>>    - My wife owns a 1998 Ford F-10 pickup.
>>>    - Both are road-worthy (interesting tid-bit: to renew my license
>>>    plate, I have to have the vehicles 'smog-tested' every 2 years - why?
>>>    Because the 2-year old test is no longer relevant AND NEEDS TO BE UPDATED)
>>>    - The resale value of my Honda is roughly 75% of what I paid for it
>>>    <https://usedfirst.com/cars/honda/>. The resale value of my wife's
>>>    pickup? about a thousand bucks
>>>    <https://www.kbb.com/ford/f150-regular-cab/1998/short-bed/?vehicleid=6490&mileage=142654&modalview=false&intent=trade-in-sell&pricetype=trade-in&condition=good&options=6431637%7Ctrue>
>>>    (if we're lucky).
>>>
>>> EVERYTHING depreciates over time, whether that is an automobile, or
>>> 'Functional User Test results'. Thus, if those test results contribute to
>>> the site "score", that diminished value will need to be accounted for as
>>> part of the larger overall scoring mechanism.
>>>
>>> Failing that, the opposite outcome is that those Functional User Tests *will
>>> be run exactly once*, at project start, and likely never run again:
>>> never mind that over time the actual user experience may degrade for a
>>> variety of reasons.
>>>
>>> *Failing to stale-date these types of user-tests over time is to
>>> essentially encourage organizations to not bother re-running those tests
>>> post launch.*
>>>
>>> Now, it could be argued that setting a 'stale date' should remain with
>>> the legislators, and that is fair (to an extent). Those legislators however
>>> are looking to us (as Subject Matter Experts) to help them 'figure that
>>> out', and our ability to help legislators understand our specification and
>>> conformance models will contribute directly to their uptake (or lack of
>>> uptake) *BY* those same legislators.
>>>
>>>
>>> *> For a tool that is built in and used in an intranet the speed it
>>> becomes inaccessible will be much slower (perhaps years) than a public
>>> website with content updated daily and new code releases every week
>>> (perhaps days).  *
>>>
>>> While the 'intranet tool' (your content) may depreciate at varying
>>> rates, there is also the relationship (and interaction) between your
>>> content and the User Agent Stack, which will "age" at the same rate for ALL
>>> content. (Microsoft's Edge Browser of 2018 is very different from the Edge
>>> Browser v.2020 for example). Since these types of tests are seeking to
>>> measure the "ability" of the user to complete or perform a function, it
>>> takes both content AND tools to achieve that, and how those tools work with
>>> the content is a critical part of the "ability" calculation. So these tests
>>> are on both content and tools combined (with shades of "Accessibility
>>> supported" in there for good measure.)
>>>
>>> Use Case/Strawman: A web page with a complex graphic created in 2005
>>> uses @longdesc to provide the longer textual description, and got a
>>> 'passing score' because it used an acceptable technique (for the day).
>>> However in 2020, @longdesc has ZERO support on Apple devices, and so for
>>> the intended primary audience of that content, they cannot access it today,
>>> even though it has been provided by the author using an approved technique.
>>> Do you still agree that the page deserves a passing score in 2020, because
>>> it had a passing score in 2005? (If no, why not?)
>>>
>>>
>>> *> How a tool vendor places it in a dashboard is totally up to the tool
>>> vendor.*
>>>
>>> No argument, and as part of this discussion I've also supplied 11
>>> different examples
>>> <https://docs.google.com/document/d/1PgmVS0s8_klxvV2ImZS1GRXHwUgKkoXQ1_y6RBMIZQw/edit#heading=h.wk6s27klxqr7>
>>> of *HOW* different vendors are dealing with this (each in their own way). I
>>> no more want to prescribe how vendors communicate the *overall
>>> accessibility health of a web property *than I do any other specific
>>> Success Criteria, but via strong anecdotal evidence I've also supplied
>>> <https://lists.w3.org/Archives/Public/w3c-wai-gl/2020AprJun/0345.html>,
>>> we know that industry both needs and wants dashboards and a 'running score'
>>> post launch of any specific piece of web content - and the dashboards
>>> examples I've provided are the evidence and proof that this is what
>>> Industry is seeking today (otherwise why would almost every vendor out
>>> there today be offering a dashboard?)
>>>
>>> My concern is that failing to account for depreciation means that our
>>> scoring system is only applicable as part of the software development life
>>> cycle (SDLC), but does not work as well in the larger context that industry
>>> is seeking: ongoing accessibility conformance *monitoring,* which is
>>> what our industry is asking for.
>>>
>>> JF
>>>
>>> On Tue, May 12, 2020 at 3:01 AM jake abma <jake.abma@gmail.com> wrote:
>>>
>>>> Ow, and we're not the body to judge about when something "deteriorate".
>>>>
>>>> If at all, that is up to the tester / auditor OR if wished up to the
>>>> dashboard maker / settings.
>>>>
>>>> Op di 12 mei 2020 om 09:58 schreef jake abma <jake.abma@gmail.com>:
>>>>
>>>>> I don't see any issues here by adding dates / time stamps to a
>>>>> conformance claim.
>>>>>
>>>>> - First of all for the specific conformance claim / report
>>>>> - If other reports are included with another time stamp, mention it
>>>>> (also the time stamp and which part it is)
>>>>> - The responsibility is up to the "conformance claimer" if he chooses
>>>>> a report to include but didn't check if it's still actual.
>>>>>
>>>>> We only provide guidance for how to test and score and ask for time
>>>>> stamps.
>>>>>
>>>>> How a tool vendor places it in a dashboard is totally up to the tool
>>>>> vendor.
>>>>>
>>>>> Cheers!
>>>>> Jake
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Op ma 11 mei 2020 om 18:10 schreef John Foliot <john.foliot@deque.com
>>>>> >:
>>>>>
>>>>>> Hi All,
>>>>>>
>>>>>> During our calls last week, the use-case of monitoring conformance
>>>>>> dashboards was raised.
>>>>>>
>>>>>> One important need for *on-going score calculation* will be for
>>>>>> usage in these scenarios. After a bit of research, it appears that many
>>>>>> different accessibility conformance tools are today offering this
>>>>>> feature/functionality already.
>>>>>>
>>>>>> Please see:
>>>>>>
>>>>>>
>>>>>> https://docs.google.com/document/d/1PgmVS0s8_klxvV2ImZS1GRXHwUgKkoXQ1_y6RBMIZQw/edit?usp=sharing
>>>>>>
>>>>>> ...for examples that I was able to track down. (Note, some examples
>>>>>> today remain at the page level - for example Google Lighthouse - whereas
>>>>>> other tools are offering composite or aggregated views of 'sites' of at
>>>>>> least 'directories' [sic].)
>>>>>>
>>>>>> It is in scenarios like this that I question the 'depreciation' of
>>>>>> user-testing scores over time (in the same way that new cars depreciate
>>>>>> when you drive them off the lot, and continue to do so over the life of the
>>>>>> vehicle).
>>>>>>
>>>>>> Large organizations are going to want up-to-date dashboards, which
>>>>>> mechanical testing can facilitate quickly, but the more complex and
>>>>>> labor-intensive tests will be run infrequently over the life-cycle of a
>>>>>> site or web-content, and I assert that this infrequency will have an impact
>>>>>> on the 'score': user-test data that is 36 months old will likely be 'dated'
>>>>>> over that time-period, and in fact may no longer be accurate.
>>>>>>
>>>>>> Our scoring mechanism will need to address that situation.
>>>>>>
>>>>>> JF
>>>>>> --
>>>>>> *John Foliot* | Principal Accessibility Strategist | W3C AC
>>>>>> Representative
>>>>>> Deque Systems - Accessibility for Good
>>>>>> deque.com
>>>>>> "I made this so long because I did not have time to make it shorter."
>>>>>> - Pascal
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>
>>> --
>>> *John Foliot* | Principal Accessibility Strategist | W3C AC
>>> Representative
>>> Deque Systems - Accessibility for Good
>>> deque.com
>>> "I made this so long because I did not have time to make it shorter." -
>>> Pascal
>>>
>>>
>>>
>>>
>>
>> --
>> *John Foliot* | Principal Accessibility Strategist | W3C AC
>> Representative
>> Deque Systems - Accessibility for Good
>> deque.com
>> "I made this so long because I did not have time to make it shorter." -
>> Pascal
>>
>>
>>
>>

-- 
*John Foliot* | Principal Accessibility Strategist | W3C AC Representative
Deque Systems - Accessibility for Good
deque.com
"I made this so long because I did not have time to make it shorter." -
Pascal
Received on Wednesday, 13 May 2020 15:07:40 UTC