Re: Identifying unstable tests (was: Review of tests upstreamed by implementors) from Tobie Langel on 2013-03-21 (public-test-infra@w3.org from January to March 2013)

From: Tobie Langel <tobie@w3.org>
Date: Thu, 21 Mar 2013 16:50:58 +0100
To: James Graham <jgraham@opera.com>
Cc: Robin Berjon <robin@w3.org>, Dirk Pranke <dpranke@chromium.org>, public-test-infra <public-test-infra@w3.org>
Message-ID: <3ECC13707BF74D6FA328547EEECBE1C9@w3.org>

On Thursday, March 21, 2013 at 4:31 PM, James Graham wrote:
> On Thu, 21 Mar 2013, Tobie Langel wrote:
> > On Thursday, March 21, 2013 at 2:56 PM, Robin Berjon wrote:
> > > On 21/03/2013 14:11 , James Graham wrote:
> > > > One approach Opera have used with success is to implement a "quarantine"
> > > > system in which each test is run a large number of times, say 200, and
> > > > tests that don't give consistent results sent back to a human for
> > > > further analysis. Any W3C test-running tool should have this ability so
> > > > that we can discover (some of) the tests that have problematically
> > > > random behaviour before they are merged into the testsuite. In addition
> > > > we should make a list of points for authors and reviewers to use so that
> > > > they avoid or reject known-unstable patterns (see e.g. [1]).
> > >  
> > > That doesn't sound too hard to do. At regular intervals, we could:
> > >  
> > > • List all pull requests through the GH API.
> > > • For each of those:
> > > • Check out a fresh copy of the repo
> > > • Apply the pull request locally
> > > • Run all tests (ideally using something that has multiple browsers,
> > > but since we're looking for breakage even just PhantomJS or something
> > > like it would already weed out trouble).
> > > • Report issues.
> > >  
> > > It's a bit of work, but it's doable.
> > Yes, such a system is planned and budgeted.
> >  
> > I hadn't thought about using it to find unstable tests, but that should  
> > be easy enough to setup. A cron job could go through the results,  
> > identify flaky tests and file bugs.
>  
> So, I'm not exactly clear what you're proposing, but experience suggests  
> that the best way to identify flaky tests is upfront, by running the test  
> multiple (hundreds) of times before it is used as part of a test run.  
> Trying to use historical result data to identify flaky tests sounds  
> appealing, but it is much more complex since both the test and the UA may  
> change between runs. That doesn't mean it's impossible, but I strongly  
> recommend implementing the simple approach first.

Noted.
> > The more complex question is what should be done with those test from  
> > the time they are identified as problematic until they're fixed. And how  
> > should this information be conveyed downstream.
>  
> In the case of files where all the tests are unstable it's easy; back out  
> the test. In the case of files with multiple tests of which only some are  
> unstable, things are more complicated. In the simplest case one might be  
> able to apply a patch to back out that subtest (obviously that requires  
> human work).

That makes sense.
> I wonder if there are more complex cases where simply backing  
> out the test is undesirable?

I would think not.
> Robin is going to kill me for this, but if we had manifest files rather  
> than trying to store all the test metadata in the test name, we could  
> store a list of child test names for each parent test url that are known  
> to be unstable so that vendors would know to skip those when looking at  
> the results.

I'm not opposed to the idea of manifest files.

--tobie

Received on Thursday, 21 March 2013 15:51:08 UTC