Re: Identifying unstable tests (was: Review of tests upstreamed by implementors) from James Graham on 2013-03-21 (public-test-infra@w3.org from January to March 2013)

From: James Graham <jgraham@opera.com>
Date: Thu, 21 Mar 2013 16:31:17 +0100 (CET)
To: Tobie Langel <tobie@w3.org>
cc: Robin Berjon <robin@w3.org>, Dirk Pranke <dpranke@chromium.org>, public-test-infra <public-test-infra@w3.org>
Message-ID: <alpine.DEB.2.02.1303211623230.7756@sirius>

On Thu, 21 Mar 2013, Tobie Langel wrote:

> On Thursday, March 21, 2013 at 2:56 PM, Robin Berjon wrote:
>> On 21/03/2013 14:11 , James Graham wrote:
>>> One approach Opera have used with success is to implement a "quarantine"
>>> system in which each test is run a large number of times, say 200, and
>>> tests that don't give consistent results sent back to a human for
>>> further analysis. Any W3C test-running tool should have this ability so
>>> that we can discover (some of) the tests that have problematically
>>> random behaviour before they are merged into the testsuite. In addition
>>> we should make a list of points for authors and reviewers to use so that
>>> they avoid or reject known-unstable patterns (see e.g. [1]).
>>
>> That doesn't sound too hard to do. At regular intervals, we could:
>>
>> • List all pull requests through the GH API.
>> • For each of those:
>> • Check out a fresh copy of the repo
>> • Apply the pull request locally
>> • Run all tests (ideally using something that has multiple browsers,
>> but since we're looking for breakage even just PhantomJS or something
>> like it would already weed out trouble).
>> • Report issues.
>>
>> It's a bit of work, but it's doable.
> Yes, such a system is planned and budgeted.
>
> I hadn't thought about using it to find unstable tests, but that should 
> be easy enough to setup. A cron job could go through the results, 
> identify flaky tests and file bugs.

So, I'm not exactly clear what you're proposing, but experience suggests 
that the best way to identify flaky tests is upfront, by running the test 
multiple (hundreds) of times before it is used as part of a test run. 
Trying to use historical result data to identify flaky tests sounds 
appealing, but it is much more complex since both the test and the UA may 
change between runs. That doesn't mean it's impossible, but I strongly 
recommend implementing the simple approach first.

> The more complex question is what should be done with those test from 
> the time they are identified as problematic until they're fixed. And how 
> should this information be conveyed downstream.

In the case of files where all the tests are unstable it's easy; back out 
the test. In the case of files with multiple tests of which only some are 
unstable, things are more complicated. In the simplest case one might be 
able to apply a patch to back out that subtest (obviously that requires 
human work). I wonder if there are more complex cases where simply backing 
out the test is undesirable?

Robin is going to kill me for this, but if we had manifest files rather 
than trying to store all the test metadata in the test name, we could 
store a list of child test names for each parent test url that are known 
to be unstable so that vendors would know to skip those when looking at 
the results.

Received on Thursday, 21 March 2013 15:31:51 UTC