Re: Questions about the Silver scoring process from Rachael Bradley Montgomery on 2020-07-14 (public-silver@w3.org from July 2020)

From: Rachael Bradley Montgomery <rachael@accessiblecommunity.org>
Date: Tue, 14 Jul 2020 12:16:53 -0400
To: John Foliot <john.foliot@deque.com>
Cc: Detlev Fischer <detlev.fischer@testkreis.de>, Silver TF <public-silver@w3.org>
Message-ID: <CAL+jyY+ugA2XYdC6qch+BFp4HA=NvBy8wgTaDQTe4LLSuVCM+w@mail.gmail.com>
+1 to defining the unit used in each test. It will cause too much variation
between testers otherwise and it should be built into tools.

In this approach, the tests that do not apply should be skipped since there
are averages. Aka, they should not count for or against the score.

That said, some tests should be rewritten from their current state so that
they are included. For example, as the SC is written right now, not having
a timeout would cause the test to be skipped but not having a timeout
should likely be counted as the best case scenario so should be included as
a 100% pass.

As Jake noted, these details are incredibly important. I personally believe
that these details will vary based on the structure we pick. I am
personally in favor of us picking a structure and then working down to
avoid rewriting the tests repeatedly.

I am also not attached to this structure but wanted to flush it out for
discussion. A different approach may use additive or subtractive scores but
that would vary considerably from this proposal.

Regards,

Rachael

On Tue, Jul 14, 2020 at 12:05 PM John Foliot <john.foliot@deque.com> wrote:

> Rachael also writes:
>
> > These would then be skipped for purposes of scoring.
>
> Skipped, or given full credit?
>
> In your use-case, *IF* a page has an accessible media, it could garner
> *more* points than if the page never had a video, and I don't see how that
> benefits anyone - it is a potential (if complicated) way of gaming the
> system: add an accessible media component to the screen to gain points lost
> because my screen content doesn't reflow...
>
> I know we've discussed 'additive' scores previously, but have we fully
> evaluated 'subtractive' scoring as well? It would certainly address this
> use case (i.e. a screen with a fully accessible media experience has the
> same score as a screen with no media, but "loses" points if the media
> experience is less than 100%)
>
> Thoughts?
>
> JF
>
> On Tue, Jul 14, 2020 at 10:54 AM John Foliot <john.foliot@deque.com>
> wrote:
>
>> Rachael writes:
>>
>> > Is the unit a word, sentence, div, paragraph, portion of the screen,
>> etc?
>>
>> Exactly! This is why scoping cannot be left to the author, it needs to be
>> defined in our spec.
>>
>> I will assert that all tests need to be run on "the screen" (aka "a
>> view") to address what Rachael called 'non-interference', and/but that
>> higher level tests (*Accessibility in Practice* -- I am not a fan of *adjectival
>> *as a term-of-art -- in part because it does not make sense when you
>> look at the definition of that term in Merriam Webster
>> <https://www.merriam-webster.com/dictionary/adjectival>) will be a
>> collection of screens/views that comprise a task or path.
>>
>> We need to define and evaluate both.
>>
>> JF
>>
>> On Tue, Jul 14, 2020 at 10:33 AM Rachael Bradley Montgomery <
>> rachael@accessiblecommunity.org> wrote:
>>
>>> Hello,
>>>
>>> I've responded with my thoughts in line. If the answers are unclear,
>>> please let me know. Some of these do clarify the limits of this approach so
>>> thank you for calling these out.
>>>
>>> I appreciate the ongoing dialog as we have limited time in meetings. :)
>>>
>>> Rachael
>>>
>>> On Tue, Jul 14, 2020 at 11:03 AM Detlev Fischer <
>>> detlev.fischer@testkreis.de> wrote:
>>>
>>>> Hi all,
>>>> as there was not enough time to discuss the scoring process, I will
>>>> raise some questions here which I hope will clarify what is intended in
>>>> this draft version.
>>>>
>>>> Slide 9 of presentation linked to in Minutes
>>>> https://www.w3.org/2020/01/28-silver-minutes.html
>>>>
>>>> 1. Identify the components and associated views needed for users to
>>>> complete the path
>>>>
>>>> DF: If I understand this correctly, this means that if I have a path
>>>> that traverse 7 views (say, from 1-shopping cart to 2-specify billing
>>>> address to 3-specify shipping adress to 4-specify payment method to
>>>> 5-enter CC details to 6-review purchse details and confiorm - to
>>>> 7-confirmation of purchase) - all these views that are part of the path
>>>> are now lumped together and there is no fine-grained score on a
>>>> particular view withoin the path?
>>>>
>>>> RB: Each view is scored individually but all the scores are grouped
>>> together for purposes of the conformance scores.
>>>
>>>
>>>> 2. Run all level 1 tests for all views within a path
>>>>
>>>> DF: This would mean PASS/FAIL rating on each viewe o the path against
>>>> each 2.X SC - what is unclear is how the percentage comes in for less
>>>> than perfect views - say, when rating against 1.3.1, your payment
>>>> details form has one field where the label is not correctly referenced
>>>> (but some placeholder is there to make this less of a show stopper),
>>>> the
>>>> others are fine - is that a subjective judgement? A quantitative
>>>> judgement? How do you determine wether 1.3.1 (or whatever that becomes)
>>>> is 90% met, 60% met (or any other figure)?
>>>>
>>>> RB:  A clarification: I think we need to see how a page would look in
>>> the current model and new model. I used SC as example "tests" in the
>>> template to let us cross reference the two models conceptually but they are
>>> an imperfect representation because they should be tests and right now,
>>> most SC include multiple tests. In the template, I included the current
>>> Pass, Fail, and Not Present so we could look at both approaches.
>>>
>>>  I originally started this approach with each test being a pass/fail.
>>> Having tried both testing approaches, testing pass/fail is much much
>>> easier.
>>>
>>> This does not allow for the % concept though unless we roll the
>>> pass/fails into %. So I tried this using an approach where test would be
>>> scored individually by %. The percent is the of passes divided by the
>>> number of instances in the view. This is pretty easy with links but hard to
>>> determine with tests like reflow or other content based tests. Is the unit
>>> a word, sentence, div, paragraph, portion of the screen, etc?
>>>
>>>
>>> 3. Note all failures on components needed to complete the path
>>>>
>>>> DF: Whether something counts as a failure is often not easy to
>>>> determine. Note that 1.3.1 despite its huge scope knows only two
>>>> Failures. So there is significant subjectivity in determining whether,
>>>> say,  a missing programmatic link of a label while a placeholder
>>>> provides a less-than-perfect hint at the content required for the field
>>>> should be registered as a FAIL of 1.3.1 (or whatever) - and that
>>>> situation is pervasive in actual testing.
>>>>
>>>
>>> RB: In my opinion, for this to work, the tests need to be as granular as
>>> possible, preferably with clearly stated passing and failing criteria.
>>> There should also be a clear relationship to only one functional outcome.
>>> This will result in a large number of tests. Each test would need to
>>> reference which technologies it applied to. Tests that do not apply, are
>>> not counted as part of the average.
>>>
>>>
>>>>
>>>> 4. Note the % tests passed for each view (total passed/total in view)
>>>>
>>>> DF: So here we have some granularity for parts of the path? And an
>>>> aggregate value? One issue tackled in complete processes is that
>>>> aggregation can be misleading: if one part of a path fails completely,
>>>> the rest can be accessible but user time is wasted just as much (or
>>>> worse) than if the entire thing was inaccessible
>>>>
>>>
>>> RB: The approach of addressing both the component level path and the
>>> views was trying to address this. Perhaps it doesn't?
>>>
>>>>
>>>> 5. Note tests that are not applicable
>>>>
>>>> DF: I don't understand that.
>>>>
>>>
>>> RB: Some tests won't apply such as captions when no media is present.
>>> Testing would note that these are not applicable within this path. These
>>> would then be skipped for purposes of scoring.
>>>
>>>
>>>>
>>>> 6. Average all the tests for a guideline for an overall %
>>>>
>>>> DF: I take it that this is the averags across all component views of a
>>>> path? See caveat above...
>>>>
>>>> Yes.
>>>
>>>
>>>> 7. Score each guideline based on % of tests passed
>>>> 100% - 3
>>>> 75-99% - 2
>>>> 50-74% - 1
>>>> 0-50% - 0
>>>>
>>>> 8. Average the score of all guidelines to a single decimal point
>>>> If average score = 3, run level 2a and/or 2b tests
>>>>
>>>> DF: So you would only proceed with running the 'softer' tests if the
>>>> 'harder level 1 tests are perfect (100%)? I don't think this is
>>>> intended...
>>>>
>>>> For day to day testing, all three types of tests should be addressed
>>> but for a conformance claim, that is what this intended.  Since there is
>>> some rounding, there is some room for imperfections but not a lot. I will
>>> add a note that this needs to be explored more. There are other ways to
>>> balance this but the risk of running the higher level tests is that it
>>> would add bias towards one disability over another. For example if
>>> usability tests within a guideline that supports cognitive disabilities up
>>> that to a 4 or 5 but the guidelines that support visual disabilities was
>>> still at a 1, the overall score would look more accessible while being
>>> inaccessible for screen reader users.
>>>
>>>
>>>> If 90% or greater of level 2a or 2b tests pass, increase the guideline
>>>> score to a 4
>>>> If 90% or greater of both 2a and 2b tests pass, increase the guideline
>>>> score to a 5
>>>>
>>>> DF: Depending on the answer above (does this only happen when 100% - 3,
>>>> which will be a rare outcome) the question is whether any of the
>>>> failures will prevent further tests on level 2a / 2b?
>>>>
>>>> Calculate overall and functional category scores
>>>>
>>>> DF: Not clear to me at the moment..
>>>>
>>>> Overall = average of all guideline scores
>>>> Each functional category = average of related guideline scores
>>>>
>>>
>>>
>>>> --
>>>> Detlev Fischer
>>>> DIAS GmbH
>>>> (Testkreis is now part of DIAS GmbH)
>>>>
>>>> Mobil +49 (0)157 57 57 57 45
>>>>
>>>> http://www.dias.de
>>>> Beratung, Tests und Schulungen für barrierefreie Websites
>>>>
>>>>
>>>>
>>>
>>> --
>>> Rachael Montgomery, PhD
>>> Director, Accessible Community
>>> rachael@accessiblecommunity.org
>>>
>>> "I will paint this day with laughter;
>>> I will frame this night in song."
>>>  - Og Mandino
>>>
>>>
>>
>> --
>> *John Foliot* | Principal Accessibility Strategist | W3C AC
>> Representative
>> Deque Systems - Accessibility for Good
>> deque.com
>> "I made this so long because I did not have time to make it shorter." -
>> Pascal "links go places, buttons do things"
>>
>>
>>
>>
>
> --
> *John Foliot* | Principal Accessibility Strategist | W3C AC
> Representative
> Deque Systems - Accessibility for Good
> deque.com
> "I made this so long because I did not have time to make it shorter." -
> Pascal "links go places, buttons do things"
>
>
>
>

-- 
Rachael Montgomery, PhD
Director, Accessible Community
rachael@accessiblecommunity.org

"I will paint this day with laughter;
I will frame this night in song."
 - Og Mandino
Received on Tuesday, 14 July 2020 16:17:18 UTC