Re: HTML Testing Task Force Conf Call Agenda 10/5/2010 from Geoffrey Sneddon on 2010-11-13 (public-html-testsuite@w3.org from November 2010)

From: Geoffrey Sneddon <gsneddon@opera.com>
Date: Sat, 13 Nov 2010 14:31:22 +0000
To: Kris Krueger <krisk@microsoft.com>
CC: "'public-html-testsuite@w3.org'" <public-html-testsuite@w3.org>, Philip Taylor <pjt47@cam.ac.uk>
Message-ID: <4CDEA13A.9010005@opera.com>
On 14/10/10 15:06, Kris Krueger wrote:
> <krisk>  james you seemed concerned about some of the non-rendering cases being manual
> <krisk>  they will fit into the test runner - just like the getElementsByClassName tests
> <jgraham>  I am concerned about all manual tests
> <krisk>  I understand your concern - though the non-rendering cases are not manual
> <jgraham>  The ones that reported a result using javascript are not a problem
> <jgraham>  I mean from the point of view of automation
> <jgraham>  The ones that require human interaction are a problem. CSS currently has a huge problem with this
> <jgraham>  They have thousands of unautomated tests that is blocking progress on making implementaion reports and may delay them getting to Rec.
> <jgraham>  So they are having a last-minute attempt to automate tests
> <jgraham>  We don't want to be in that situation in a few years time
> <jgraham>  Also, from the point of view of a browser maker, tests that require manual work are more effort, and more prone to break, than automated tests. So they are a big cost
> <jgraham>  That's assuming that you have a system for running such tests in a semi-automated way, which not all browser makers have
> <krisk>  I think the other problem with the css group is all the test work was done at the very end
> <jgraham>  (as it turns out, Opera do have such a system, and we still want to avoid using it wherever possible)
> <krisk>  a lot of the tests in the css suite have been around for a long time
> <jgraham>  The upshot of all of this is that I am loathe to approve tests that are not automated and do not come with a clear explaination of why they *cannot* be automated
> <krisk>  not sure why running them and doing implementation reports had to wait till the very end
> <jgraham>  You have to run them at the very end because the testsuite will change right up to the very end
> <jgraham>  and implementations will keep changing
> <krisk>  But the churn goes way down...like any really big software project
> <jgraham>  The churn on implementations might not
> <jgraham>  I mean the testsuite might converge gracefully
> <jgraham>  (or might not, anyone could dump thousands of tests at the last minute)
> <krisk>  dumping 1000's of tests at the last minute is not good
> <jgraham>  But implementations move at a steady pace and we can assume it will be the bleeding edge implementations that we need for the IR
> <jgraham>  Even if things work out for the IR, the QA point is IMHO more significant
> <krisk>  Did someone from google run all the css tests in 3 days?
> <jgraham>  Don't know
> <krisk>  yep - look in the css lists - it was done by tab atkins
> <jgraham>  In any case, 3 days is an insane amount of time for a single run on a single platform
> <jgraham>  We need tests that we can run continuously on a variety of platforms and configurations
> <krisk>  really - surely google can afford to hire a person for less than a week once every few years to get a spec to rec
> <jgraham>  I'm not sure what the relevance of what Google can afford to do is
> <krisk>  my point of view is that the reviewing of the tests are far bigger than the cost to run the tests
> <jgraham>  The point is that it's not in anyone's best interests to have tests that are cumbersome to run
> <jgraham>  Because a test is written once and run thousands of times
> <jgraham>  If we want implementations that actually pass all the tests then we need them to be easy to run
> <jgraham>  So people can make changes without regressing other tests
> <jgraham>  Just running the tests is only half the battle; having the interoperable implementations is the other (harder) part
> <jgraham>  That's why I am more concerned about the QA aspect than the IR aspect
> <krisk>  So does opera have a had time running the css tests?
> <krisk>  I would think that you (Opera) would have found bugs in your implementation a while back
> <krisk>  especially from tests that have been submitted years ago
> <jgraham>  Yes. We have a system that allows us to automate running them after the first time, but getting the initial data about what is a pass and what is a fail is difficult and error-prone. Then small changes in the implementation can invalidate a lot of the data and require all the manual effort to be redone
> <jgraham>  We are actively trying to replace the manual CSS tests with reftests where possible
> <jgraham>  We don't want to have to lead a similar cleanup operation again
> <krisk>  so can't you build your automation system in // to the html5 wg progress?
> <jgraham>  The problem with the automation system is fundamental
> <jgraham>  It depends on screenshots and there are a lot of things that can cause a screenshot to change other than a failed test
> <krisk>  sure that it software development
> <krisk>  when a product churns - regressions can blow up 1000's of tests (for example a fault in the network stack)
> <jgraham>  But it's not a regression
> <jgraham>  It can just be a change
> <jgraham>  Different rounding behaviour across platforms

I think the most important comment from the above for Opera is that we 
are /far/ more concerned with the usefulness of tests for QA than we are 
about getting the spec to REC (well, PR really).

To give some more background to what happened with the CSS tests:

We were running, until recently, a very old version of the MS CSS 2.1 
tests, predating even their submission to the CSS WG. Why hadn't we 
updated it before? Updating it is expensive, so there's a strong 
encouragement to avoid updating it too often.

The cost of updating the tests was two-fold: the names of tests changed 
when they were submitted to the CSS WG, which means the knowledge of 
what screenshots were passes before doesn't help; and the number of 
changed tests means we'll have to label new screenshots regardless.

Eventually when we did update our copy of the testsuite, because we 
couldn't use any of the data for the previous screenshots, we had ~10k 
screenshots that needed to be labeled, per platform. In the end, we had 
seven people labeling screenshots for three days.

But that cost doesn't necessarily end there: if we wish to start running 
our automated testing system on more products, we have to go through and 
label all 10k tests for that platform (for in all probability, the 
rendering won't be identical due to font rendering, rounding 
differences, etc.). This creates a high barrier to entry to using all of 
our automated test-suite, especially for products which have a single QA 
allocated to them, where three days labeling screenshots can be hard to 
justify.

And it's not just that where the cost ends: every time we change 
behaviour, we can potentially have to relabel large numbers of 
screenshots. Equally, font-rendering changes make library (or OS) 
upgrades expensive to do…

The biggest issue for us is not as Kris suggested dealing with 
regressions: it's the avoidable cost of labeling screenshots, and the 
time spent labeling any changes on multiple products and platforms. Yes, 
given reftests, we will still have invest time in investigating 
regressions reported by the regression tracking system, but we don't 
then have a large cost for running automated testing on a new 
product/platform, and we don't have a large cost for making system 
upgrades of the testing platform, and we don't have a large cumulative 
cost labeling screenshots every time there is any behaviour change 
(regression or simply a change that gives a different passing screenshot).

Ultimately, our preference, as I have stated before, for all testsuites 
is that tests that can report pass/fail back from JS are preferable to 
everything else (and a single harness is preferable as then we only have 
to hack a single harness to send the results back to our automated 
testing system). If that isn't possible, reftests should be used for all 
visual tests. Visual tests that aren't reftests should by and large only 
exist to make sure the basic references render correctly, and everything 
else should follow through a chain of passes from one reftest to the 
next. Finally, actually manually interactive tests should be avoided 
more or less at any cost, as they're not going to be run very often 
(because manual testing is expensive), leading to far wider regression 
windows, which massively increases the cost of dealing with them.

As for why IRs have to wait until the very end, tests change and 
implementations change. We can't just run changed tests because we don't 
know that an implementation change hasn't changed an unchanged test. 
Ideally, we should be able to do IRs by simply dumping the data from our 
automated testing system (which for all but around 500 tests was done 
for the CSS 2.1 IR), where we want to be running them anyway and where 
cost of running matters far more than it would if we were just running 
them once for an IR.

-- 
Geoffrey Sneddon — Opera Software
<http://gsnedders.com>
<http://opera.com>
Received on Saturday, 13 November 2010 14:32:06 UTC