Re: On testing HTML from Philip Taylor on 2009-09-22 (public-html@w3.org from September 2009)

From: Philip Taylor <pjt47@cam.ac.uk>
Date: Tue, 22 Sep 2009 16:39:45 +0100
To: Philippe Le Hegaret <plh@w3.org>
CC: public-html@w3.org
Message-ID: <4AB8EFC1.70004@cam.ac.uk>
Philippe Le Hegaret wrote:
> A few of us got together recently with the idea of improving the state
> of Web browser testing at W3C. Since this Group is discussing the
> creation of an effort for the purpose of testing the HTML specification,
> this is relevant here as well:
> [...]

(I assume you meant to link to http://omocha.w3.org/ somewhere in this 
email.)

My first impression is that this sounds great! It seems to be focusing 
on what I see as perhaps the most important goal (improving 
interoperability between implementations), and perhaps the most 
important challenge (scaling the process to cope with the complexity of 
modern specs and the necessary depth of testing).


Apologies for some long rambling thoughts:

I like automation - if there's going to be hundreds or thousands of test 
cases, I expect the overall effort will be minimised if each test case 
is as simple as possible to write and review and run, even if it 
requires a great deal of automation tool support. That also means that 
once the tools are developed, adding a new test case is very cheap, so 
people have fewer excuses to not write tests.

When writing some HTML5 canvas tests a while ago 
(<http://philip.html5.org/tests/canvas/suite/tests/>; I don't have much 
experience with writing other tests so my perspective is biased towards 
this), the approach I took was to eliminate almost all boilerplate from 
the hand-written input for each test, and move the complexity into a 
Python tool that converts them into executable code. So there's a single 
hand-written file, about ten thousand lines long, containing test case 
specifications like:

- name: 2d.drawImage.3arg
   testing:
     - 2d.drawImage.defaultsource
     - 2d.drawImage.defaultdest
   images:
     - red.png
     - green.png
   code: |
     ctx.drawImage(document.getElementById('green.png'), 0, 0);
     ctx.drawImage(document.getElementById('red.png'), -100, 0);
     ctx.drawImage(document.getElementById('red.png'), 100, 0);
     ctx.drawImage(document.getElementById('red.png'), 0, -50);
     ctx.drawImage(document.getElementById('red.png'), 0, 50);

     @assert pixel 0,0 ==~ 0,255,0,255;
     @assert pixel 99,0 ==~ 0,255,0,255;
     @assert pixel 0,49 ==~ 0,255,0,255;
     @assert pixel 99,49 ==~ 0,255,0,255;
   expected: green

using YAML syntax, giving: a name (in an arbitrary but useful 
hierarchy); a list of named spec sentences (described in a separate 
file) that this test is testing conformance to; a list of images to load 
before running the test; some JS code to execute, with some special 
syntax that means "the pixel at 0,0 must be approximately (+/- 2) equal 
to rgba(0,255,0,255)"; and then a specification of the expected output 
image, either the common keyword "green" or else some Python+Pycairo 
code that generates the image.

(Reftest-style precise comparison is unsuitable for all the canvas 
tests: browsers have freedom in antialiasing and rounding etc, so their 
output won't precisely match the expected output image, hence the 
approximate tests of a small set of pixels.)

Then there's a thousand lines of Python and JS to transform these into 
executable tests (each becomes a standalone file containing lots of 
information about the test, and another visually-simpler version for 
giving an overview of lots of tests simultaneously, and another 
visually-simplest version for easy pass/fail verification, and also a 
Mozilla Mochitest version) and to quickly collect results from browsers 
(it detects results automatically where possible, and otherwise requires 
the user to press 'y'/'n' if two images look similar/different) and then 
combine the results into 
<http://philip.html5.org/tests/canvas/suite/tests/results.html>.

I think this approach has been quite effective so far - there's enough 
commonality between canvas tests that they can all fit into this 
framework without stretching it too much, and it made it easy for me to 
write hundreds of tests and to scan through the source file and update 
tests when the spec changed, and the Python tool was easily 
adapted(/hacked) to generate test files in a new format for Mozilla's 
automated testing system. Much of the tool code is of no value for 
anything except canvas tests, but that's okay because it's sufficiently 
valuable for canvas tests.


One difficulty is that I'm the only person who can update tests, and 
also there's no process for review. Ideally it would be easy for other 
people to make and deploy changes without any involvement from me. 
Making use of a standardised centralised test suite system would be 
great, because I'm too lazy to write any of that myself. But the people 
editing tests should be editing the YAML source file, not any kind of 
boilerplateful processed output. So I guess the test suite system would 
have to incorporate the canvas-specific Python processing tool. That 
sounds potentially complex and nasty, but I don't see any other way to 
achieve the goal of maximally simplifying test case development, so 
maybe it's inevitable.

That is probably the most serious design decision I see for the test 
suite system - should it have a single standard test case format for 
every test in the whole universe, or hundreds of different formats with 
their own processing tools that output a common format, or hundreds of 
different formats with no common format and each with their own 
test-runners, or something in between? and if it's anything other than 
the first option, how will the processing tools be written and executed 
and maintained?

(With my current approach, there's also the difficulty that nobody but 
me understands the test format or the processing code, since they're 
somewhat idiosyncratic, but hopefully that could be resolved if there 
was some simplification and documentation...)


A few random comments about / potential additions (if people agree) to 
<http://omocha.w3.org/wiki/wishes>:

Is avoidance of test duplication a goal? e.g. if two people 
independently developed test suites for the same section of the spec, 
would it be best to just stick all the test cases into the official test 
suite (which is easy to do, and ensures as many requirements as possible 
are tested, though some will be tested twice (and every test needs to be 
reviewed and maintained)), or is it best to carefully merge them so 
every test case is distinct and necessary? The same situation occurs if 
e.g. a canvas test suite tests that videos can be drawn onto it, and a 
video test suite tests that it can be drawn onto a canvas.

Duplicates aren't useful; the question is whether they are harmful, to 
an extent that makes de-duplication worthwhile. I have no data or 
experience to know the best balance, but it seems like something there 
should eventually be a clear policy on.

It probably should always be possible to point people at a URL of a 
single test, that executes the test in their browser and lets them know 
if it's passed - that's very useful when submitting bug reports or 
discussing bugs. Ideally no test should rely on an external test-runner 
(though it should have one of those too).

The "under review" -> {"approved", "rejected"} approach doesn't quite 
seem adequate, because specs will change (while in CR, or while in Rec 
with errata) and approved tests might become invalid, but it would be 
wasteful to send every single test back to "under review" just because 
of a single change in the spec. Maybe there needs to be some "approved 
and probably still valid since the spec changed but it shouldn't affect 
this" and "approved and possibly invalid since the spec changed and 
might affect this but nobody has reviewed it again carefully yet" 
states, or similar, for things that just need a quick check before being 
considered "approved" again.

Performance is not stated as a desire (except as a consequence of 
parallelism), but it probably should be. E.g. originally the canvas 
tests were imported into Mozilla's automated system as a load of 
individual files, but they were merged into a single giant file 
containing all the tests to minimise the page-loading overhead. The 
faster the tests are, the more likely they are to be run, so it seems an 
important concern.

Some tests are not strictly either script-verified or human-verified. 
E.g. most of the canvas tests use getImageData to automatically verify 
the output, but in some cases that might not work (a browser might not 
implement getImageData at all, or might have bugs that prevent it 
working in certain obscure situations). It doesn't seem helpful to 
penalise browsers for getImageData bugs when the test is meant to be 
testing something totally unrelated, so the tests dynamically fall back 
on human verification. It would be nice to retain support for that, 
instead of requiring tests to be predefined as either automatic or manual.

> Regards,
> 
> Philippe

-- 
Philip Taylor
pjt47@cam.ac.uk
Received on Tuesday, 22 September 2009 15:40:27 UTC