- From: Tobie Langel <tobie@w3.org>
- Date: Thu, 21 Mar 2013 16:50:58 +0100
- To: James Graham <jgraham@opera.com>
- Cc: Robin Berjon <robin@w3.org>, Dirk Pranke <dpranke@chromium.org>, public-test-infra <public-test-infra@w3.org>
On Thursday, March 21, 2013 at 4:31 PM, James Graham wrote: > On Thu, 21 Mar 2013, Tobie Langel wrote: > > On Thursday, March 21, 2013 at 2:56 PM, Robin Berjon wrote: > > > On 21/03/2013 14:11 , James Graham wrote: > > > > One approach Opera have used with success is to implement a "quarantine" > > > > system in which each test is run a large number of times, say 200, and > > > > tests that don't give consistent results sent back to a human for > > > > further analysis. Any W3C test-running tool should have this ability so > > > > that we can discover (some of) the tests that have problematically > > > > random behaviour before they are merged into the testsuite. In addition > > > > we should make a list of points for authors and reviewers to use so that > > > > they avoid or reject known-unstable patterns (see e.g. [1]). > > > > > > That doesn't sound too hard to do. At regular intervals, we could: > > > > > > • List all pull requests through the GH API. > > > • For each of those: > > > • Check out a fresh copy of the repo > > > • Apply the pull request locally > > > • Run all tests (ideally using something that has multiple browsers, > > > but since we're looking for breakage even just PhantomJS or something > > > like it would already weed out trouble). > > > • Report issues. > > > > > > It's a bit of work, but it's doable. > > Yes, such a system is planned and budgeted. > > > > I hadn't thought about using it to find unstable tests, but that should > > be easy enough to setup. A cron job could go through the results, > > identify flaky tests and file bugs. > > So, I'm not exactly clear what you're proposing, but experience suggests > that the best way to identify flaky tests is upfront, by running the test > multiple (hundreds) of times before it is used as part of a test run. > Trying to use historical result data to identify flaky tests sounds > appealing, but it is much more complex since both the test and the UA may > change between runs. That doesn't mean it's impossible, but I strongly > recommend implementing the simple approach first. Noted. > > The more complex question is what should be done with those test from > > the time they are identified as problematic until they're fixed. And how > > should this information be conveyed downstream. > > In the case of files where all the tests are unstable it's easy; back out > the test. In the case of files with multiple tests of which only some are > unstable, things are more complicated. In the simplest case one might be > able to apply a patch to back out that subtest (obviously that requires > human work). That makes sense. > I wonder if there are more complex cases where simply backing > out the test is undesirable? I would think not. > Robin is going to kill me for this, but if we had manifest files rather > than trying to store all the test metadata in the test name, we could > store a list of child test names for each parent test url that are known > to be unstable so that vendors would know to skip those when looking at > the results. I'm not opposed to the idea of manifest files. --tobie
Received on Thursday, 21 March 2013 15:51:08 UTC