Re: Knowing which tests are in the repository from James Graham on 2013-08-27 (public-test-infra@w3.org from July to September 2013)

From: James Graham <james@hoppipolla.co.uk>
Date: Tue, 27 Aug 2013 11:51:05 +0100
To: Dirk Pranke <dpranke@chromium.org>
CC: "public-test-infra@w3.org" <public-test-infra@w3.org>
Message-ID: <521C8499.3070901@hoppipolla.co.uk>
On 23/08/13 18:48, Dirk Pranke wrote:

> The way Blink and WebKit's test harness works (for their own checked-in
> tests, not the W3C's), we walk a directory looking for files that have
> specific file extensions and aren't in directories with particular
> names. All (and only) the matches are tests; end of story. References
> can be found either by filename convention or (in a few rare cases not
> really used today) by parsing a reftest manifest. I think we really only
> support the manifest format for feature completeness and under the
> belief that we would need it sooner or later when importing the W3C's
> tests. It was (and remains) a fairly controversial practice compared to
> using filename conventions.
>
> We handle the timeout problem as follows: First, we expect tests to be
> fast (sub-second), and we don't expect them to timeout regularly (since
> running tests that timeout don't really give you much of a signal and
> wastes a lot of time).
>
> Second, we pick a default timeout. In Blink, this is 6 seconds, which
> works well for 99% (literally) of the tests on a variety of hardware
> (this number could probably be a couple seconds higher or lower without
> much impact), but the number is adjusted based on the build
> configuration (debug builds are slower) and platform (android is slower)
> . Second, We have a separate manifest-ish file for marking a subset of
> tests as Slow, and they get a 30s timeout. In WebKit, we have a much
> longer default timeout (30s) and don't use Slow markers at all.
>
> There is no build step, and no parsing of tests on the fly at test run
> time (except as part of the actual test execution, of course). It works
> well, and any delays caused by scanning for files or dealing with
> timeouts is a small (1-3%) part of the total test run.

It is worth noting that there are a few differences between running a 
testsuite that is specifically designed for one browser and a testsuite 
that is intended for use across multiple products. With a 
specifically-designed testsuite it is usually expected that all tests 
will pass. This is somewhat reasonable as one only writes tests for 
ones' own browser that correspond to implemented features in that 
browser. Even then, it is quite common to need some extra information in 
the tests to mark known failures corresponding to bugs that haven't been 
fixed yet.

When one is importing tests, it isn't reasonable to expect all the tests 
to pass, or to all be for features that you have actually implemented. 
For a certain class of test, the only way of detecting a fail is to wait 
for a timeout. For example if you didn't implement setTimeout, and a 
test tried to check that setting a timer worked, you would have to wait 
until the harness timeout for the test to fail. If this is set to a very 
high value — like 30s for all tests — waiting so long would make the 
testsuite prohibitively slow. Therefore being able to choose an 
appropriate timeout for each test seems much more important for imported 
testsuites since they are much more likely to hit the slow cases than 
implementation-specific testsuites.

Since tests are expected to fail, and since updating tests should be 
easy, test runners designed around the assumption "all tests must pass" 
need additional data stating which tests are, in fact, known failures. 
Putting this data inside the test files themselves is not very sane as 
it is hard to read/write these files automatically and likely to lead to 
merge conflicts. Such data will therefore have to go in some sort of 
external manifest, and keeping it up to date for a specific 
implementation probably implies an elaborate "build step" (a special 
testrun run that records the failures in a known-good build and updates 
the manifest somehow; I'm not clear on all the details here and indeed 
this seems like one of the principal challenges in running W3C tests on 
vendor infrastructure since the process I just described is both complex 
to implement and racy). If this is taken as a requirement, avoiding the 
part of the update process where you update metadata for files that 
changed since your last import seems like a relatively small win. If, on 
the other hand, you have some process in mind that avoids the need for a 
complex synchronization of the expected failures, I would be intrigued 
to hear it.
Received on Tuesday, 27 August 2013 10:52:47 UTC