Re: Questions about the Silver scoring process from Rachael Bradley Montgomery on 2020-07-14 (public-silver@w3.org from July 2020)

From: Rachael Bradley Montgomery <rachael@accessiblecommunity.org>
Date: Tue, 14 Jul 2020 11:32:26 -0400
To: Detlev Fischer <detlev.fischer@testkreis.de>
Cc: Silver TF <public-silver@w3.org>
Message-ID: <CAL+jyYJ5mq8g-cft8zcdWxKmw=F-eDvXz91gWmsFKQhTiZn_AA@mail.gmail.com>
Hello,

I've responded with my thoughts in line. If the answers are unclear, please
let me know. Some of these do clarify the limits of this approach so thank
you for calling these out.

I appreciate the ongoing dialog as we have limited time in meetings. :)

Rachael

On Tue, Jul 14, 2020 at 11:03 AM Detlev Fischer <detlev.fischer@testkreis.de>
wrote:

> Hi all,
> as there was not enough time to discuss the scoring process, I will
> raise some questions here which I hope will clarify what is intended in
> this draft version.
>
> Slide 9 of presentation linked to in Minutes
> https://www.w3.org/2020/01/28-silver-minutes.html
>
> 1. Identify the components and associated views needed for users to
> complete the path
>
> DF: If I understand this correctly, this means that if I have a path
> that traverse 7 views (say, from 1-shopping cart to 2-specify billing
> address to 3-specify shipping adress to 4-specify payment method to
> 5-enter CC details to 6-review purchse details and confiorm - to
> 7-confirmation of purchase) - all these views that are part of the path
> are now lumped together and there is no fine-grained score on a
> particular view withoin the path?
>
> RB: Each view is scored individually but all the scores are grouped
together for purposes of the conformance scores.


> 2. Run all level 1 tests for all views within a path
>
> DF: This would mean PASS/FAIL rating on each viewe o the path against
> each 2.X SC - what is unclear is how the percentage comes in for less
> than perfect views - say, when rating against 1.3.1, your payment
> details form has one field where the label is not correctly referenced
> (but some placeholder is there to make this less of a show stopper), the
> others are fine - is that a subjective judgement? A quantitative
> judgement? How do you determine wether 1.3.1 (or whatever that becomes)
> is 90% met, 60% met (or any other figure)?
>
> RB:  A clarification: I think we need to see how a page would look in the
current model and new model. I used SC as example "tests" in the template
to let us cross reference the two models conceptually but they are an
imperfect representation because they should be tests and right now, most
SC include multiple tests. In the template, I included the current Pass,
Fail, and Not Present so we could look at both approaches.

 I originally started this approach with each test being a pass/fail.
Having tried both testing approaches, testing pass/fail is much much
easier.

This does not allow for the % concept though unless we roll the pass/fails
into %. So I tried this using an approach where test would be scored
individually by %. The percent is the of passes divided by the number of
instances in the view. This is pretty easy with links but hard to determine
with tests like reflow or other content based tests. Is the unit a word,
sentence, div, paragraph, portion of the screen, etc?


3. Note all failures on components needed to complete the path
>
> DF: Whether something counts as a failure is often not easy to
> determine. Note that 1.3.1 despite its huge scope knows only two
> Failures. So there is significant subjectivity in determining whether,
> say,  a missing programmatic link of a label while a placeholder
> provides a less-than-perfect hint at the content required for the field
> should be registered as a FAIL of 1.3.1 (or whatever) - and that
> situation is pervasive in actual testing.
>

RB: In my opinion, for this to work, the tests need to be as granular as
possible, preferably with clearly stated passing and failing criteria.
There should also be a clear relationship to only one functional outcome.
This will result in a large number of tests. Each test would need to
reference which technologies it applied to. Tests that do not apply, are
not counted as part of the average.


>
> 4. Note the % tests passed for each view (total passed/total in view)
>
> DF: So here we have some granularity for parts of the path? And an
> aggregate value? One issue tackled in complete processes is that
> aggregation can be misleading: if one part of a path fails completely,
> the rest can be accessible but user time is wasted just as much (or
> worse) than if the entire thing was inaccessible
>

RB: The approach of addressing both the component level path and the views
was trying to address this. Perhaps it doesn't?

>
> 5. Note tests that are not applicable
>
> DF: I don't understand that.
>

RB: Some tests won't apply such as captions when no media is present.
Testing would note that these are not applicable within this path. These
would then be skipped for purposes of scoring.


>
> 6. Average all the tests for a guideline for an overall %
>
> DF: I take it that this is the averags across all component views of a
> path? See caveat above...
>
> Yes.


> 7. Score each guideline based on % of tests passed
> 100% - 3
> 75-99% - 2
> 50-74% - 1
> 0-50% - 0
>
> 8. Average the score of all guidelines to a single decimal point
> If average score = 3, run level 2a and/or 2b tests
>
> DF: So you would only proceed with running the 'softer' tests if the
> 'harder level 1 tests are perfect (100%)? I don't think this is intended...
>
> For day to day testing, all three types of tests should be addressed but
for a conformance claim, that is what this intended.  Since there is some
rounding, there is some room for imperfections but not a lot. I will add a
note that this needs to be explored more. There are other ways to balance
this but the risk of running the higher level tests is that it would add
bias towards one disability over another. For example if usability tests
within a guideline that supports cognitive disabilities up that to a 4 or 5
but the guidelines that support visual disabilities was still at a 1, the
overall score would look more accessible while being inaccessible for
screen reader users.


> If 90% or greater of level 2a or 2b tests pass, increase the guideline
> score to a 4
> If 90% or greater of both 2a and 2b tests pass, increase the guideline
> score to a 5
>
> DF: Depending on the answer above (does this only happen when 100% - 3,
> which will be a rare outcome) the question is whether any of the
> failures will prevent further tests on level 2a / 2b?
>
> Calculate overall and functional category scores
>
> DF: Not clear to me at the moment..
>
> Overall = average of all guideline scores
> Each functional category = average of related guideline scores
>


> --
> Detlev Fischer
> DIAS GmbH
> (Testkreis is now part of DIAS GmbH)
>
> Mobil +49 (0)157 57 57 57 45
>
> http://www.dias.de
> Beratung, Tests und Schulungen für barrierefreie Websites
>
>
>

-- 
Rachael Montgomery, PhD
Director, Accessible Community
rachael@accessiblecommunity.org

"I will paint this day with laughter;
I will frame this night in song."
 - Og Mandino
Received on Tuesday, 14 July 2020 15:32:51 UTC