- From: James Graham <jgraham@opera.com>
- Date: Thu, 21 Mar 2013 14:11:01 +0100 (CET)
- To: Robin Berjon <robin@w3.org>
- cc: Dirk Pranke <dpranke@chromium.org>, public-test-infra <public-test-infra@w3.org>
On Thu, 21 Mar 2013, Robin Berjon wrote: >> Another challenge lies in making it absolutely painless to pull new >> tests and run them. Fixing this one first (which is what I believe James >> was largely alluding to) will go a long way to building up the trust >> (and make it easier to build it up further). > > I actually think that that's the first challenge to get the virtuous cycle > kick-started. In the first runs it's likely that there will be problems, the > least we can do is to make performing a test run as easy to set up as > possible. > > Any requirements you can provide in this area would be extremely helpful. So I can provide some input here based on Opera's experience plus past discussions I've had with people working on other engines. First of all, to address the earlier point of test quality; it is true that there are more stringent requirements for quality when running tests in automated test systems, particularly if they have to run on a wide variety of devices. In particular noise in such systems is a killer to productivity and has to be eliminated. One approach Opera have used with success is to implement a "quarantine" system in which each test is run a large number of times, say 200, and tests that don't give consistent results sent back to a human for further analysis. Any W3C test-running tool should have this ability so that we can discover (some of) the tests that have problematically random behaviour before they are merged into the testsuite. In addition we should make a list of points for authors and reviewers to use so that they avoid or reject known-unstable patterns (see e.g. [1]). In terms of the effect of number of tests on performance, I don't think we need to be terribly concerned, at least on desktop (on devices all problems are harder, so we should concentrate on the easy cases first). Looking at a randomly chosen test run, I can see that Opera's system ran almost 6000 javascript-based test files containing a total of 230,000 tests in a little under 4 machine hours, giving a time per test of about 62ms. To give an idea of the overhead per test, a different type of test with a single javascript test per file ran in about 250ms/test. This is on rather old hardware, so I would expect others to do better right now. In conclusion, tests are individually quick to run and easilly distributed across multiple machines, so there is little point in putting large numbers of man hours into optimising the performance of tests (e.g. by trying to eliminate all duplication). I think that covers the technical problems that might stop people wanting to run the tests at all, although I would be interested to hear about any other technical objections to this. There is also the problem that people might simply not see the value in putting any effort into running the tests or fixing the problems that they find. This is a complex and slightly seperate problem, which is at least partly a PR issue; it should be considered unacceptable by authors to release buggy implementations of specs, and W3C tests should be considered the best way to find out if implementations are buggy. Working out how to do that without compromising the quality of the tests is a rather different discussion. Assuming that implementors actually want to import and run the tests, there are a number of practical issues that they face. The first is simply that they must sync the external repository with the one in which they keep their tests. That's pretty trivial if you run git and pretty much a headache if you don't. So for most vendors at the moment it's a headache. Once you have imported the tests, it must be possible to tell which files you should actually load and what system should be used to get results (i.e., given a is it a reftest, is it a testharness test, is it a manual test, is it a support file? Is the url to load the file actually the url to the test or is there a query/fragment part? Is there an optional query/fragment part that changes the details of the test?). There have been several suggestions for solutions to these problems, but there is no actual solution at the moment. Many vendor's systems are designed around the assumption that "all tests must pass" and, for the rare cases where tests don't pass, one is expected to manually annotate the test as failing. This is problematic if you suddenly import 10,000 tests for a feature that you haven't implemented yet. Or even 100 tests of which 27 fail. I don't have a good solution for this other than "don't design your test system like that" (which is rather late). I presume the answer will look something like a means of auto-marking tests as expected-fail on their first run after import. We also have the problem that many of the tests simply won't run in vendor's systems. Tests that require an extra server to be set up (e.g. websockets tests) are a particular problem, but they are rare. More problematic is that many people can't run tests that depend on Apache+PHP (because they run all the servers on the individual test node and don't have Apache+PHP in that environment). Unless everyone is happy to deploy something as heavyweight as Apache+PHP, we may need to standardise on a diffferent solution for tests that require custom server-side logic. Based on previous discussions, this would likely be a custom Python-based server, with special features for testing (I believe Chrome/WebKit already have something like this?). One final issue (that I can think of right now ;) is that it must be possible for everyone to *run* the tests and get results out. This should in theory be rather easy since one can implement a custom testharnessreport.js for javascript tests, and people already know how to run reftests. But sometimes the details of people's testing systems are very specialised in strange ways so this can be a larger barrier than you might assume. In addition to getting people *running* tests, we need more work on how to get people *writing* tests and *submitting* tests by default. That is probably a different topic. FWIW, because it seems to be rather easy to get people to attend meetings and rather hard to get people to do actual work, it might be a good idea to organise a "testing workshop" in which relevant people from vendors sit down and try to actually figure out what solutions they want to these problems and then proceed to actually do the implementation work (think of it as "test the web forward: vendor automation edition"). [1] https://developer.mozilla.org/en-US/docs/Mozilla/QA/Avoiding_intermittent_oranges
Received on Thursday, 21 March 2013 13:11:35 UTC