- From: Dirk Pranke <dpranke@chromium.org>
- Date: Thu, 21 Mar 2013 15:38:50 -0700
- To: James Graham <jgraham@opera.com>
- Cc: Tobie Langel <tobie@w3.org>, Robin Berjon <robin@w3.org>, public-test-infra <public-test-infra@w3.org>
- Message-ID: <CAEoffTAHxa8X4MRbF09TJpoE+EsZwaZBNZAoOX_QAiZjmnn4Fw@mail.gmail.com>
On Thu, Mar 21, 2013 at 8:31 AM, James Graham <jgraham@opera.com> wrote: > > > On Thu, 21 Mar 2013, Tobie Langel wrote: > > On Thursday, March 21, 2013 at 2:56 PM, Robin Berjon wrote: >> >>> On 21/03/2013 14:11 , James Graham wrote: >>> >>>> One approach Opera have used with success is to implement a "quarantine" >>>> system in which each test is run a large number of times, say 200, and >>>> tests that don't give consistent results sent back to a human for >>>> further analysis. Any W3C test-running tool should have this ability so >>>> that we can discover (some of) the tests that have problematically >>>> random behaviour before they are merged into the testsuite. In addition >>>> we should make a list of points for authors and reviewers to use so that >>>> they avoid or reject known-unstable patterns (see e.g. [1]). >>>> >>> >>> That doesn't sound too hard to do. At regular intervals, we could: >>> >>> • List all pull requests through the GH API. >>> • For each of those: >>> • Check out a fresh copy of the repo >>> • Apply the pull request locally >>> • Run all tests (ideally using something that has multiple browsers, >>> but since we're looking for breakage even just PhantomJS or something >>> like it would already weed out trouble). >>> • Report issues. >>> >>> It's a bit of work, but it's doable. >>> >> Yes, such a system is planned and budgeted. >> >> I hadn't thought about using it to find unstable tests, but that should >> be easy enough to setup. A cron job could go through the results, identify >> flaky tests and file bugs. >> > > So, I'm not exactly clear what you're proposing, but experience suggests > that the best way to identify flaky tests is upfront, by running the test > multiple (hundreds) of times before it is used as part of a test run. > Trying to use historical result data to identify flaky tests sounds > appealing, but it is much more complex since both the test and the UA may > change between runs. That doesn't mean it's impossible, but I strongly > recommend implementing the simple approach first. > > FWIW, WebKit has invested a fair amount of time in tracking test flakiness over time. The initial up-front beating probably identifies many problems, but we often find cases where tests can be flaky on different machine configurations (e.g., runs fine on a big workstation, but not on a VM) or can be flaky when the tests accidentally introduce side effects into the environment or the test executables. It's a significant source of pain, unfortunately. -- Dirk
Received on Thursday, 21 March 2013 22:39:38 UTC