- From: James Graham <jgraham@opera.com>
- Date: Thu, 21 Mar 2013 16:31:17 +0100 (CET)
- To: Tobie Langel <tobie@w3.org>
- cc: Robin Berjon <robin@w3.org>, Dirk Pranke <dpranke@chromium.org>, public-test-infra <public-test-infra@w3.org>
- Message-ID: <alpine.DEB.2.02.1303211623230.7756@sirius>
On Thu, 21 Mar 2013, Tobie Langel wrote: > On Thursday, March 21, 2013 at 2:56 PM, Robin Berjon wrote: >> On 21/03/2013 14:11 , James Graham wrote: >>> One approach Opera have used with success is to implement a "quarantine" >>> system in which each test is run a large number of times, say 200, and >>> tests that don't give consistent results sent back to a human for >>> further analysis. Any W3C test-running tool should have this ability so >>> that we can discover (some of) the tests that have problematically >>> random behaviour before they are merged into the testsuite. In addition >>> we should make a list of points for authors and reviewers to use so that >>> they avoid or reject known-unstable patterns (see e.g. [1]). >> >> That doesn't sound too hard to do. At regular intervals, we could: >> >> • List all pull requests through the GH API. >> • For each of those: >> • Check out a fresh copy of the repo >> • Apply the pull request locally >> • Run all tests (ideally using something that has multiple browsers, >> but since we're looking for breakage even just PhantomJS or something >> like it would already weed out trouble). >> • Report issues. >> >> It's a bit of work, but it's doable. > Yes, such a system is planned and budgeted. > > I hadn't thought about using it to find unstable tests, but that should > be easy enough to setup. A cron job could go through the results, > identify flaky tests and file bugs. So, I'm not exactly clear what you're proposing, but experience suggests that the best way to identify flaky tests is upfront, by running the test multiple (hundreds) of times before it is used as part of a test run. Trying to use historical result data to identify flaky tests sounds appealing, but it is much more complex since both the test and the UA may change between runs. That doesn't mean it's impossible, but I strongly recommend implementing the simple approach first. > The more complex question is what should be done with those test from > the time they are identified as problematic until they're fixed. And how > should this information be conveyed downstream. In the case of files where all the tests are unstable it's easy; back out the test. In the case of files with multiple tests of which only some are unstable, things are more complicated. In the simplest case one might be able to apply a patch to back out that subtest (obviously that requires human work). I wonder if there are more complex cases where simply backing out the test is undesirable? Robin is going to kill me for this, but if we had manifest files rather than trying to store all the test metadata in the test name, we could store a list of child test names for each parent test url that are known to be unstable so that vendors would know to skip those when looking at the results.
Received on Thursday, 21 March 2013 15:31:51 UTC