Re: wptrunner and how to handle ref tests from James Graham on 2014-07-01 (public-test-infra@w3.org from July to September 2014)

From: James Graham <james@hoppipolla.co.uk>
Date: Tue, 01 Jul 2014 12:33:03 +0100
To: public-test-infra@w3.org
Message-ID: <53B29C6F.8010903@hoppipolla.co.uk>
On 01/07/14 01:22, Dirk Pranke wrote:
>
> On Mon, Jun 30, 2014 at 5:06 PM, Anton Modella Quintana (Plain
> Concepts Corporation) <v-antonm@microsoft.com
> <mailto:v-antonm@microsoft.com>> wrote:
>
>     Hello public-test-infra,
>
>     As Erika said previously [1], Microsoft is working on adding
>     support to IE to wptrunner and contributing back as much as we
>     can. While we created our first internal prototype one of the
>     problems we found were the ref tests. Some of them were failing
>     just because the antialias on a curve was different depending on
>     the browser. I don't think those tests should fail.
>     To mitigate the number of false negatives we tested different
>     approaches and at the end we decided to use ImageMagick, its
>     compare tool and a fuzz factor [2]. Basically we compare how
>     different the two images are and if we get a factor equal or less
>     than 0.015  then we pass the test. These value is experimental and
>     it is the best we got after trying different algorithms and
>     factors. I've attached a few images for you to better see how even
>     if the images are not exactly equal, the test should pass (at
>     least in this example).
>
>     Some concerns about this approach:
>     * It has a dependency on ImageMagick (we could implement the
>     algorithm to remove this dependency if needed)
>     * There might be some tests where the factor should be tweaked or
>     even disabled. This number could even change depending on the
>     browser we are testing
>
>     So what does public-test-infra think of this?
>
> I believe that I have seen similar sorts of reftest failures in Blink
> and WebKit over the years as well, though I'm not sure if we have them
> currently (we probably do).

I know we have similar problems with Mozilla reftests. I think our
current solution is simply to quantify the maximum number of pixels that
can be different. I was hoping we could avoid solving this for
web-platform-tests, but maybe that's over-optimistic. Do you have a list
of the tests that are giving incorrect results without the use of
ImageMagick?

> I would be a bit sad to pull in a dependency on ImageMagick given that
> it is in Perl, but presumably different platforms can do different
> things as need be. 
>
That requires us to understand the algorithm, to the level we can
reimplement it. I'm not sure we currently have that level of
understanding of what Imagemagick does.

I also have principled worries about this approach. For example, it's
pretty clear that most tests showing a green square shouldn't show *any*
differences; a single red pixel might be a fail. Similarly, but more
worryingly, we could have a test where a small difference in one area
would be fine (e.g. due to antialiasing differences on a curve), but a
small difference in another area would be a problem.

I suppose the sophisticated solution is to require that where we allow
any difference at all, we check in a mask image that identifies the
pixels that are allowed to differ (and, potentially, how much they are
allowed to differ by). I'm not sure if this is too much effort for the
amount of benefit, however.
Received on Tuesday, 1 July 2014 11:33:32 UTC