Re: Identifying unstable tests (was: Review of tests upstreamed by implementors) from Dirk Pranke on 2013-03-21 (public-test-infra@w3.org from January to March 2013)

From: Dirk Pranke <dpranke@chromium.org>
Date: Thu, 21 Mar 2013 15:38:50 -0700
To: James Graham <jgraham@opera.com>
Cc: Tobie Langel <tobie@w3.org>, Robin Berjon <robin@w3.org>, public-test-infra <public-test-infra@w3.org>
Message-ID: <CAEoffTAHxa8X4MRbF09TJpoE+EsZwaZBNZAoOX_QAiZjmnn4Fw@mail.gmail.com>

On Thu, Mar 21, 2013 at 8:31 AM, James Graham <jgraham@opera.com> wrote:

>
>
> On Thu, 21 Mar 2013, Tobie Langel wrote:
>
>  On Thursday, March 21, 2013 at 2:56 PM, Robin Berjon wrote:
>>
>>> On 21/03/2013 14:11 , James Graham wrote:
>>>
>>>> One approach Opera have used with success is to implement a "quarantine"
>>>> system in which each test is run a large number of times, say 200, and
>>>> tests that don't give consistent results sent back to a human for
>>>> further analysis. Any W3C test-running tool should have this ability so
>>>> that we can discover (some of) the tests that have problematically
>>>> random behaviour before they are merged into the testsuite. In addition
>>>> we should make a list of points for authors and reviewers to use so that
>>>> they avoid or reject known-unstable patterns (see e.g. [1]).
>>>>
>>>
>>> That doesn't sound too hard to do. At regular intervals, we could:
>>>
>>> • List all pull requests through the GH API.
>>> • For each of those:
>>> • Check out a fresh copy of the repo
>>> • Apply the pull request locally
>>> • Run all tests (ideally using something that has multiple browsers,
>>> but since we're looking for breakage even just PhantomJS or something
>>> like it would already weed out trouble).
>>> • Report issues.
>>>
>>> It's a bit of work, but it's doable.
>>>
>> Yes, such a system is planned and budgeted.
>>
>> I hadn't thought about using it to find unstable tests, but that should
>> be easy enough to setup. A cron job could go through the results, identify
>> flaky tests and file bugs.
>>
>
> So, I'm not exactly clear what you're proposing, but experience suggests
> that the best way to identify flaky tests is upfront, by running the test
> multiple (hundreds) of times before it is used as part of a test run.
> Trying to use historical result data to identify flaky tests sounds
> appealing, but it is much more complex since both the test and the UA may
> change between runs. That doesn't mean it's impossible, but I strongly
> recommend implementing the simple approach first.
>
>
FWIW, WebKit has invested a fair amount of time in tracking test flakiness
over time. The initial up-front beating probably identifies many problems,
but we often find cases where tests can be flaky on different machine
configurations (e.g., runs fine on a big workstation, but not on a VM) or
can be flaky when the tests accidentally introduce side effects into the
environment or the test executables.

It's a significant source of pain, unfortunately.

-- Dirk

Received on Thursday, 21 March 2013 22:39:38 UTC