Re: Test review procedure from Aryeh Gregor on 2011-03-18 (public-html-testsuite@w3.org from March 2011)

From: Aryeh Gregor <Simetrical+w3c@gmail.com>
Date: Fri, 18 Mar 2011 13:34:13 -0400
To: James Graham <jgraham@opera.com>
Cc: "L. David Baron" <dbaron@dbaron.org>, Kris Krueger <krisk@microsoft.com>, "public-html-testsuite@w3.org" <public-html-testsuite@w3.org>
Message-ID: <AANLkTikb=m-HEa9s=Hv0jJKWYdGRO3BPuWzW35spGhGm@mail.gmail.com>
On Fri, Mar 18, 2011 at 4:24 AM, James Graham <jgraham@opera.com> wrote:
> b) A substantial reason for the above seems to be a lack of interest from
> browser vendors in performing review. So far most review has been performed,
> as far as I can tell, by individuals. Having a process that people aren't
> interested in contributing to seems problematic. Having said that, I wonder
> if people are actually using the HTMLWG tests as part of their regression
> test suites yet. Perhaps if they start to use the tests there will be more
> incentive to perform review. It is also possible that they will only look at
> the tests they fail and ignore the ones they pass.

How will using tests as part of regression suites encourage review?
Regression tests, by definition, just encode the browser's current
behavior, and the only time you'll have to think about them is when
they pass or fail unexpectedly.  I'd expect that implementers would be
reviewing these tests mainly when they implement new features that
have an existing standard test suite (so that they don't have to write
their own test suite), or when users or organizations advertise or
otherwise point out their test failures.

> c) In spite of the above, the review that has happened has uncovered a
> non-trivial number of bugs.

Yes, which is why we need review.  But I'm not convinced that letting
tests remain hidden in submission/ with no reliable path to review is
a good strategy.

> d) The CSS 2.1 testsuite has had a lot of churn just before PR because
> people suddenly started taking it seriously and reporting problems. This is
> not good for anyone and we should try to avoid the same thing happening.

Having the same level of churn in submission/ doesn't seem like it's
particularly better.  We should instead be encouraging people to take
tests seriously at an earlier point.  The only way I see to do that is
to try advertising them more broadly, in some fashion.

> e) Trying to address d) by having continually updated implementation reports
> has a number of bad side effect; it encourages the erroneous use of the
> testsuite as a metric of quality when it is a largely incomplete state. This
> in turn can increase the pressure on potential contributers to submit only
> tests that cast their favoured browser in a good light.

Evidently some contributors will do that anyway -- Microsoft has only
submitted tests that IE9 passes, at least that I've noticed.  That's
fine as long as the tests are correct, and as long as we don't report
percentages.  David Baron has pointed out on this list that it's a bad
idea to report percentages as long as any implementer can submit
tests.

David suggested that instead we only report whether browsers have
passed all tests for a given feature.  That would encourage each
implementer to submit tests that other browsers fail, without
particularly encouraging them to submit low-quality or duplicate
tests.  This is a good thing, as long as the tests are correct.  If
they aren't, then the other browsers can point that out and get the
tests removed.

> So my conclusion is that the process as-is doesn't work optimally, but I am
> not clear what the best way forward is.

How about something like this:

1) Divide approved tests up into reasonably coarse features, and
report which browsers pass all tests for each feature -- i.e., have no
bugs for that feature.  Make it clear that these are the only official
results and the Testing TF doesn't endorse posting percentages or
posting results from only a subset of the tests.  But try to advertise
the test results prominently, to encourage implementers to improve
their scores.

2) When someone wants to submit a set of tests for approval, it should
be posted for feedback.  The post should say which browsers pass all
the tests and which do not.  If they're tests for an already-tested
feature, it should also say if any browsers that pass all current
tests for the feature will now fail.  Any tests that have no
objections standing after thirty days should be approved.

3) If anyone has an objection to an approved test, and the objection
can't be addressed immediately, the test should be un-approved.

The advantages of this procedure would be:

* It's hard to game.  Either you pass all the tests or you don't.  You
can't make your competitors look worse than you unless no one can find
any legitimate bugs in your implementation, *and* you can find at
least one legitimate bug in their implementation, in which case you
deserve to make your competitors look worse than you.
* Once an implementation passes all tests for a feature, they have a
strong incentive to review any new tests that they fail, and fix their
implementation so they pass them.
* Once an implementation passes all tests for a feature, their
competitors have a strong incentive to write more tests that uncover
bugs in that implementation.
* Once an implementation passes all tests for a feature, their
competitors have an incentive to not just improve their implementation
incrementally, but make it pass *all* tests.
* Random people will be encouraged to write tests, since they have a
clear path to acceptance.

Disadvantages would be:

* You could still try to game the system by trying to divide up the
tests in a way that makes you look good.  If we require the tests to
be reasonably coarse or set general standards for how we divide them
up -- like one report for each DOM property or method, say -- this
would still be relatively hard, particularly compared to reporting
percentages.
* There's not much incentive to fix easy bugs in the implementation of
a feature, if there's even one hard bug that will make you still fail.
 For instance, IE and Opera mishandle nulls in DOM strings, so they
fail at least one of my reflection tests for every single reflected
attribute.  This bug is a legitimate bug, but it's unimportant to
authors and probably a pain to fix.  Nonetheless, until they fix it,
fixing other DOM-related bugs won't actually improve their test score.

Overall, I think this system lines up all the incentives in exactly
the right place as soon as at least one browser passes on any given
feature.  Then their competitors will want to pass all the tests for
that feature, their competitors will want to submit more tests to get
them to fail, and the passing implementation will want to carefully
review new tests they fail.  It's not a great system before you have
at least one browser passing all tests for a feature, but I think it's
better than the alternatives so far proposed.

On Fri, Mar 18, 2011 at 10:46 AM, James Graham <jgraham@opera.com> wrote:
> It occurs to me that one way around this would be to make tables of the
> number of browsers failing each test, but not list the browsers that fail in
> each case. Such an approach would have a number of advantages:
>
> It would be easy to identify tests that failed in multiple browsers, which
> are the most likely to be problematic
>
> It would require people examining the tests to rerun for themselves rather
> than trusting the submitted results.
>
> It would not lend itself to use as a (misleading) browser comparison metric.

What incentives would this give to anyone to submit, review, or
conform to tests?  The point of tests isn't just to help browsers
improve their regression suites, it's to get them to conform.
Received on Friday, 18 March 2011 17:35:07 UTC