Re: Review of tests upstreamed by implementors from James Graham on 2013-03-21 (public-test-infra@w3.org from January to March 2013)

From: James Graham <jgraham@opera.com>
Date: Thu, 21 Mar 2013 14:11:01 +0100 (CET)
To: Robin Berjon <robin@w3.org>
cc: Dirk Pranke <dpranke@chromium.org>, public-test-infra <public-test-infra@w3.org>
Message-ID: <alpine.DEB.2.02.1303211242520.7756@sirius>
On Thu, 21 Mar 2013, Robin Berjon wrote:

>> Another challenge lies in making it absolutely painless to pull new
>> tests and run them. Fixing this one first (which is what I believe James
>> was largely alluding to) will go a long way to building up the trust
>> (and make it easier to build it up further).
>
> I actually think that that's the first challenge to get the virtuous cycle 
> kick-started. In the first runs it's likely that there will be problems, the 
> least we can do is to make performing a test run as easy to set up as 
> possible.
>
> Any requirements you can provide in this area would be extremely helpful.

So I can provide some input here based on Opera's experience plus past 
discussions I've had with people working on other engines.

First of all, to address the earlier point of test quality; it is true 
that there are more stringent requirements for quality when running tests 
in automated test systems, particularly if they have to run on a wide 
variety of devices. In particular noise in such systems is a killer to 
productivity and has to be eliminated.

One approach Opera have used with success is to implement a "quarantine" 
system in which each test is run a large number of times, say 200, and 
tests that don't give consistent results sent back to a human for further 
analysis. Any W3C test-running tool should have this ability so that we 
can discover (some of) the tests that have problematically random 
behaviour before they are merged into the testsuite. In addition we should 
make a list of points for authors and reviewers to use so that they 
avoid or reject known-unstable patterns (see e.g. [1]).

In terms of the effect of number of tests on performance, I don't think we 
need to be terribly concerned, at least on desktop (on devices all 
problems are harder, so we should concentrate on the easy cases first). 
Looking at a randomly chosen test run, I can see that Opera's system ran 
almost 6000 javascript-based test files containing a total of 230,000 
tests in a little under 4 machine hours, giving a time per test of about 
62ms. To give an idea of the overhead per test, a different type of test 
with a single javascript test per file ran in about 250ms/test. This is on 
rather old hardware, so I would expect others to do better right now. In 
conclusion, tests are individually quick to run and easilly distributed 
across multiple machines, so there is little point in putting large 
numbers of man hours into optimising the performance of tests (e.g. by 
trying to eliminate all duplication).

I think that covers the technical problems that might stop people wanting 
to run the tests at all, although I would be interested to hear about any 
other technical objections to this. There is also the problem that people 
might simply not see the value in putting any effort into running the 
tests or fixing the problems that they find. This is a complex and 
slightly seperate problem, which is at least partly a PR issue; it should 
be considered unacceptable by authors to release buggy implementations of 
specs, and W3C tests should be considered the best way to find out if 
implementations are buggy. Working out how to do that without compromising 
the quality of the tests is a rather different discussion.

Assuming that implementors actually want to import and run the tests, 
there are a number of practical issues that they face. The first is simply 
that they must sync the external repository with the one in which they 
keep their tests. That's pretty trivial if you run git and pretty much a 
headache if you don't. So for most vendors at the moment it's a headache.

Once you have imported the tests, it must be possible to tell which files 
you should actually load and what system should be used to get results 
(i.e., given a is it a reftest, is it a testharness test, is it a manual 
test, is it a support file? Is the url to load the file actually the url 
to the test or is there a query/fragment part? Is there an optional 
query/fragment part that changes the details of the test?). There have 
been several suggestions for solutions to these problems, but there is no 
actual solution at the moment.

Many vendor's systems are designed around the assumption that "all tests 
must pass" and, for the rare cases where tests don't pass, one is expected 
to manually annotate the test as failing. This is problematic if you 
suddenly import 10,000 tests for a feature that you haven't implemented 
yet. Or even 100 tests of which 27 fail. I don't have a good solution for 
this other than "don't design your test system like that" (which is rather 
late). I presume the answer will look something like a means of 
auto-marking tests as expected-fail on their first run after import.

We also have the problem that many of the tests simply won't run in 
vendor's systems. Tests that require an extra server to be set up (e.g. 
websockets tests) are a particular problem, but they are rare. More 
problematic is that many people can't run tests that depend on Apache+PHP 
(because they run all the servers on the individual test node and don't 
have Apache+PHP in that environment). Unless everyone is happy to deploy 
something as heavyweight as Apache+PHP, we may need to standardise on a 
diffferent solution for tests that require custom server-side logic. Based 
on previous discussions, this would likely be a custom Python-based 
server, with special features for testing (I believe Chrome/WebKit already 
have something like this?).

One final issue (that I can think of right now ;) is that it must be 
possible for everyone to *run* the tests and get results out. This should 
in theory be rather easy since one can implement a custom 
testharnessreport.js for javascript tests, and people already know how to 
run reftests. But sometimes the details of people's testing systems are 
very specialised in strange ways so this can be a larger barrier than you 
might assume.

In addition to getting people *running* tests, we need more work on how to 
get people *writing* tests and *submitting* tests by default. That is 
probably a different topic.

FWIW, because it seems to be rather easy to get people to attend meetings 
and rather hard to get people to do actual work, it might be a good idea 
to organise a "testing workshop" in which relevant people from vendors sit 
down and try to actually figure out what solutions they want to these 
problems and then proceed to actually do the implementation work (think of 
it as "test the web forward: vendor automation edition").

[1] 
https://developer.mozilla.org/en-US/docs/Mozilla/QA/Avoiding_intermittent_oranges
Received on Thursday, 21 March 2013 13:11:35 UTC