Re: Questions about the Silver scoring process from John Foliot on 2020-07-14 (public-silver@w3.org from July 2020)

From: John Foliot <john.foliot@deque.com>
Date: Tue, 14 Jul 2020 11:04:31 -0500
To: Rachael Bradley Montgomery <rachael@accessiblecommunity.org>
Cc: Detlev Fischer <detlev.fischer@testkreis.de>, Silver TF <public-silver@w3.org>
Message-ID: <CAKdCpxzxSFcftNVWFDZ3eVOdbN1oX=+DfSs=HtUKnPMUqA14ww@mail.gmail.com>
Rachael also writes:

> These would then be skipped for purposes of scoring.

Skipped, or given full credit?

In your use-case, *IF* a page has an accessible media, it could garner
*more* points than if the page never had a video, and I don't see how that
benefits anyone - it is a potential (if complicated) way of gaming the
system: add an accessible media component to the screen to gain points lost
because my screen content doesn't reflow...

I know we've discussed 'additive' scores previously, but have we fully
evaluated 'subtractive' scoring as well? It would certainly address this
use case (i.e. a screen with a fully accessible media experience has the
same score as a screen with no media, but "loses" points if the media
experience is less than 100%)

Thoughts?

JF

On Tue, Jul 14, 2020 at 10:54 AM John Foliot <john.foliot@deque.com> wrote:

> Rachael writes:
>
> > Is the unit a word, sentence, div, paragraph, portion of the screen, etc?
>
> Exactly! This is why scoping cannot be left to the author, it needs to be
> defined in our spec.
>
> I will assert that all tests need to be run on "the screen" (aka "a view")
> to address what Rachael called 'non-interference', and/but that higher
> level tests (*Accessibility in Practice* -- I am not a fan of *adjectival
> *as a term-of-art -- in part because it does not make sense when you look
> at the definition of that term in Merriam Webster
> <https://www.merriam-webster.com/dictionary/adjectival>) will be a
> collection of screens/views that comprise a task or path.
>
> We need to define and evaluate both.
>
> JF
>
> On Tue, Jul 14, 2020 at 10:33 AM Rachael Bradley Montgomery <
> rachael@accessiblecommunity.org> wrote:
>
>> Hello,
>>
>> I've responded with my thoughts in line. If the answers are unclear,
>> please let me know. Some of these do clarify the limits of this approach so
>> thank you for calling these out.
>>
>> I appreciate the ongoing dialog as we have limited time in meetings. :)
>>
>> Rachael
>>
>> On Tue, Jul 14, 2020 at 11:03 AM Detlev Fischer <
>> detlev.fischer@testkreis.de> wrote:
>>
>>> Hi all,
>>> as there was not enough time to discuss the scoring process, I will
>>> raise some questions here which I hope will clarify what is intended in
>>> this draft version.
>>>
>>> Slide 9 of presentation linked to in Minutes
>>> https://www.w3.org/2020/01/28-silver-minutes.html
>>>
>>> 1. Identify the components and associated views needed for users to
>>> complete the path
>>>
>>> DF: If I understand this correctly, this means that if I have a path
>>> that traverse 7 views (say, from 1-shopping cart to 2-specify billing
>>> address to 3-specify shipping adress to 4-specify payment method to
>>> 5-enter CC details to 6-review purchse details and confiorm - to
>>> 7-confirmation of purchase) - all these views that are part of the path
>>> are now lumped together and there is no fine-grained score on a
>>> particular view withoin the path?
>>>
>>> RB: Each view is scored individually but all the scores are grouped
>> together for purposes of the conformance scores.
>>
>>
>>> 2. Run all level 1 tests for all views within a path
>>>
>>> DF: This would mean PASS/FAIL rating on each viewe o the path against
>>> each 2.X SC - what is unclear is how the percentage comes in for less
>>> than perfect views - say, when rating against 1.3.1, your payment
>>> details form has one field where the label is not correctly referenced
>>> (but some placeholder is there to make this less of a show stopper), the
>>> others are fine - is that a subjective judgement? A quantitative
>>> judgement? How do you determine wether 1.3.1 (or whatever that becomes)
>>> is 90% met, 60% met (or any other figure)?
>>>
>>> RB:  A clarification: I think we need to see how a page would look in
>> the current model and new model. I used SC as example "tests" in the
>> template to let us cross reference the two models conceptually but they are
>> an imperfect representation because they should be tests and right now,
>> most SC include multiple tests. In the template, I included the current
>> Pass, Fail, and Not Present so we could look at both approaches.
>>
>>  I originally started this approach with each test being a pass/fail.
>> Having tried both testing approaches, testing pass/fail is much much
>> easier.
>>
>> This does not allow for the % concept though unless we roll the
>> pass/fails into %. So I tried this using an approach where test would be
>> scored individually by %. The percent is the of passes divided by the
>> number of instances in the view. This is pretty easy with links but hard to
>> determine with tests like reflow or other content based tests. Is the unit
>> a word, sentence, div, paragraph, portion of the screen, etc?
>>
>>
>> 3. Note all failures on components needed to complete the path
>>>
>>> DF: Whether something counts as a failure is often not easy to
>>> determine. Note that 1.3.1 despite its huge scope knows only two
>>> Failures. So there is significant subjectivity in determining whether,
>>> say,  a missing programmatic link of a label while a placeholder
>>> provides a less-than-perfect hint at the content required for the field
>>> should be registered as a FAIL of 1.3.1 (or whatever) - and that
>>> situation is pervasive in actual testing.
>>>
>>
>> RB: In my opinion, for this to work, the tests need to be as granular as
>> possible, preferably with clearly stated passing and failing criteria.
>> There should also be a clear relationship to only one functional outcome.
>> This will result in a large number of tests. Each test would need to
>> reference which technologies it applied to. Tests that do not apply, are
>> not counted as part of the average.
>>
>>
>>>
>>> 4. Note the % tests passed for each view (total passed/total in view)
>>>
>>> DF: So here we have some granularity for parts of the path? And an
>>> aggregate value? One issue tackled in complete processes is that
>>> aggregation can be misleading: if one part of a path fails completely,
>>> the rest can be accessible but user time is wasted just as much (or
>>> worse) than if the entire thing was inaccessible
>>>
>>
>> RB: The approach of addressing both the component level path and the
>> views was trying to address this. Perhaps it doesn't?
>>
>>>
>>> 5. Note tests that are not applicable
>>>
>>> DF: I don't understand that.
>>>
>>
>> RB: Some tests won't apply such as captions when no media is present.
>> Testing would note that these are not applicable within this path. These
>> would then be skipped for purposes of scoring.
>>
>>
>>>
>>> 6. Average all the tests for a guideline for an overall %
>>>
>>> DF: I take it that this is the averags across all component views of a
>>> path? See caveat above...
>>>
>>> Yes.
>>
>>
>>> 7. Score each guideline based on % of tests passed
>>> 100% - 3
>>> 75-99% - 2
>>> 50-74% - 1
>>> 0-50% - 0
>>>
>>> 8. Average the score of all guidelines to a single decimal point
>>> If average score = 3, run level 2a and/or 2b tests
>>>
>>> DF: So you would only proceed with running the 'softer' tests if the
>>> 'harder level 1 tests are perfect (100%)? I don't think this is
>>> intended...
>>>
>>> For day to day testing, all three types of tests should be addressed but
>> for a conformance claim, that is what this intended.  Since there is some
>> rounding, there is some room for imperfections but not a lot. I will add a
>> note that this needs to be explored more. There are other ways to balance
>> this but the risk of running the higher level tests is that it would add
>> bias towards one disability over another. For example if usability tests
>> within a guideline that supports cognitive disabilities up that to a 4 or 5
>> but the guidelines that support visual disabilities was still at a 1, the
>> overall score would look more accessible while being inaccessible for
>> screen reader users.
>>
>>
>>> If 90% or greater of level 2a or 2b tests pass, increase the guideline
>>> score to a 4
>>> If 90% or greater of both 2a and 2b tests pass, increase the guideline
>>> score to a 5
>>>
>>> DF: Depending on the answer above (does this only happen when 100% - 3,
>>> which will be a rare outcome) the question is whether any of the
>>> failures will prevent further tests on level 2a / 2b?
>>>
>>> Calculate overall and functional category scores
>>>
>>> DF: Not clear to me at the moment..
>>>
>>> Overall = average of all guideline scores
>>> Each functional category = average of related guideline scores
>>>
>>
>>
>>> --
>>> Detlev Fischer
>>> DIAS GmbH
>>> (Testkreis is now part of DIAS GmbH)
>>>
>>> Mobil +49 (0)157 57 57 57 45
>>>
>>> http://www.dias.de
>>> Beratung, Tests und Schulungen für barrierefreie Websites
>>>
>>>
>>>
>>
>> --
>> Rachael Montgomery, PhD
>> Director, Accessible Community
>> rachael@accessiblecommunity.org
>>
>> "I will paint this day with laughter;
>> I will frame this night in song."
>>  - Og Mandino
>>
>>
>
> --
> *John Foliot* | Principal Accessibility Strategist | W3C AC
> Representative
> Deque Systems - Accessibility for Good
> deque.com
> "I made this so long because I did not have time to make it shorter." -
> Pascal "links go places, buttons do things"
>
>
>
>

-- 
*John Foliot* | Principal Accessibility Strategist | W3C AC Representative
Deque Systems - Accessibility for Good
deque.com
"I made this so long because I did not have time to make it shorter." -
Pascal "links go places, buttons do things"
Received on Tuesday, 14 July 2020 16:05:22 UTC