Re: Review of tests upstreamed by implementors

Hi James,

thanks for delving into the details here.

On 21/03/2013 14:11 , James Graham wrote:
> One approach Opera have used with success is to implement a "quarantine"
> system in which each test is run a large number of times, say 200, and
> tests that don't give consistent results sent back to a human for
> further analysis. Any W3C test-running tool should have this ability so
> that we can discover (some of) the tests that have problematically
> random behaviour before they are merged into the testsuite. In addition
> we should make a list of points for authors and reviewers to use so that
> they avoid or reject known-unstable patterns (see e.g. [1]).

That doesn't sound too hard to do. At regular intervals, we could:

• List all pull requests through the GH API.
• For each of those:
   • Check out a fresh copy of the repo
   • Apply the pull request locally
   • Run all tests (ideally using something that has multiple browsers, 
but since we're looking for breakage even just PhantomJS or something 
like it would already weed out trouble).
   • Report issues.

It's a bit of work, but it's doable.

> Assuming that implementors actually want to import and run the tests,
> there are a number of practical issues that they face. The first is
> simply that they must sync the external repository with the one in which
> they keep their tests. That's pretty trivial if you run git and pretty
> much a headache if you don't. So for most vendors at the moment it's a
> headache.

Just a silly thought, I may be missing something, but does that sync 
really need to use git? I mean presumably it's a read-only sync that 
only needs to happen once a day or so (it's not a big deal if runs are 
24h behind the repo I would think). In that case, if git is an issue, 
one can just grab and unpack:

     https://github.com/w3c/web-platform-tests/archive/master.zip

> Once you have imported the tests, it must be possible to tell which
> files you should actually load and what system should be used to get
> results (i.e., given a is it a reftest, is it a testharness test, is it
> a manual test, is it a support file? Is the url to load the file
> actually the url to the test or is there a query/fragment part? Is there
> an optional query/fragment part that changes the details of the test?).
> There have been several suggestions for solutions to these problems, but
> there is no actual solution at the moment.

So we discussed this before but didn't reach a conclusion. I think that 
as much as possible this information should be simple to extract. For 
ref tests and assets, a naming convention is IMHO the best (and I have a 
list of files that don't produce any testharness output).

For query strings, can we not avoid that? It seems better to not have to 
special case this. I know that the parsing tests fall in this category, 
but would it be terribly complicated to have three entry point test 
files that call the same code with the right variation of parameters 
rather than relying on a query string?

> Based on previous discussions, this would likely be a
> custom Python-based server, with special features for testing (I believe
> Chrome/WebKit already have something like this?).

We've had this discussion a few times — can vendors tell us what they 
would be willing to run? That way we can pick some common denominator 
and run with it.

> In addition to getting people *running* tests, we need more work on how
> to get people *writing* tests and *submitting* tests by default. That is
> probably a different topic.

Yeah, let's keep that separate. Besides, if we can really truly promise 
developers that when they contribute a test, it will get run by 
implementers, that provides a huge incentive to contribute.

> FWIW, because it seems to be rather easy to get people to attend
> meetings and rather hard to get people to do actual work, it might be a
> good idea to organise a "testing workshop" in which relevant people from
> vendors sit down and try to actually figure out what solutions they want
> to these problems and then proceed to actually do the implementation
> work (think of it as "test the web forward: vendor automation edition").

Yes, we've done meetings in which we mostly hacked on stuff in the past, 
and they've generally been rather successful. They're also good because 
even if you don't finish what you started doing in the meeting (you 
rarely do) once someone has stared hacking on something they tend to 
continue afterwards. Certainly more so than after talking about it.

-- 
Robin Berjon - http://berjon.com/ - @robinberjon

Received on Thursday, 21 March 2013 13:56:39 UTC