- From: Philip Taylor <pjt47@cam.ac.uk>
- Date: Tue, 22 Sep 2009 16:39:45 +0100
- To: Philippe Le Hegaret <plh@w3.org>
- CC: public-html@w3.org
Philippe Le Hegaret wrote:
> A few of us got together recently with the idea of improving the state
> of Web browser testing at W3C. Since this Group is discussing the
> creation of an effort for the purpose of testing the HTML specification,
> this is relevant here as well:
> [...]
(I assume you meant to link to http://omocha.w3.org/ somewhere in this
email.)
My first impression is that this sounds great! It seems to be focusing
on what I see as perhaps the most important goal (improving
interoperability between implementations), and perhaps the most
important challenge (scaling the process to cope with the complexity of
modern specs and the necessary depth of testing).
Apologies for some long rambling thoughts:
I like automation - if there's going to be hundreds or thousands of test
cases, I expect the overall effort will be minimised if each test case
is as simple as possible to write and review and run, even if it
requires a great deal of automation tool support. That also means that
once the tools are developed, adding a new test case is very cheap, so
people have fewer excuses to not write tests.
When writing some HTML5 canvas tests a while ago
(<http://philip.html5.org/tests/canvas/suite/tests/>; I don't have much
experience with writing other tests so my perspective is biased towards
this), the approach I took was to eliminate almost all boilerplate from
the hand-written input for each test, and move the complexity into a
Python tool that converts them into executable code. So there's a single
hand-written file, about ten thousand lines long, containing test case
specifications like:
- name: 2d.drawImage.3arg
testing:
- 2d.drawImage.defaultsource
- 2d.drawImage.defaultdest
images:
- red.png
- green.png
code: |
ctx.drawImage(document.getElementById('green.png'), 0, 0);
ctx.drawImage(document.getElementById('red.png'), -100, 0);
ctx.drawImage(document.getElementById('red.png'), 100, 0);
ctx.drawImage(document.getElementById('red.png'), 0, -50);
ctx.drawImage(document.getElementById('red.png'), 0, 50);
@assert pixel 0,0 ==~ 0,255,0,255;
@assert pixel 99,0 ==~ 0,255,0,255;
@assert pixel 0,49 ==~ 0,255,0,255;
@assert pixel 99,49 ==~ 0,255,0,255;
expected: green
using YAML syntax, giving: a name (in an arbitrary but useful
hierarchy); a list of named spec sentences (described in a separate
file) that this test is testing conformance to; a list of images to load
before running the test; some JS code to execute, with some special
syntax that means "the pixel at 0,0 must be approximately (+/- 2) equal
to rgba(0,255,0,255)"; and then a specification of the expected output
image, either the common keyword "green" or else some Python+Pycairo
code that generates the image.
(Reftest-style precise comparison is unsuitable for all the canvas
tests: browsers have freedom in antialiasing and rounding etc, so their
output won't precisely match the expected output image, hence the
approximate tests of a small set of pixels.)
Then there's a thousand lines of Python and JS to transform these into
executable tests (each becomes a standalone file containing lots of
information about the test, and another visually-simpler version for
giving an overview of lots of tests simultaneously, and another
visually-simplest version for easy pass/fail verification, and also a
Mozilla Mochitest version) and to quickly collect results from browsers
(it detects results automatically where possible, and otherwise requires
the user to press 'y'/'n' if two images look similar/different) and then
combine the results into
<http://philip.html5.org/tests/canvas/suite/tests/results.html>.
I think this approach has been quite effective so far - there's enough
commonality between canvas tests that they can all fit into this
framework without stretching it too much, and it made it easy for me to
write hundreds of tests and to scan through the source file and update
tests when the spec changed, and the Python tool was easily
adapted(/hacked) to generate test files in a new format for Mozilla's
automated testing system. Much of the tool code is of no value for
anything except canvas tests, but that's okay because it's sufficiently
valuable for canvas tests.
One difficulty is that I'm the only person who can update tests, and
also there's no process for review. Ideally it would be easy for other
people to make and deploy changes without any involvement from me.
Making use of a standardised centralised test suite system would be
great, because I'm too lazy to write any of that myself. But the people
editing tests should be editing the YAML source file, not any kind of
boilerplateful processed output. So I guess the test suite system would
have to incorporate the canvas-specific Python processing tool. That
sounds potentially complex and nasty, but I don't see any other way to
achieve the goal of maximally simplifying test case development, so
maybe it's inevitable.
That is probably the most serious design decision I see for the test
suite system - should it have a single standard test case format for
every test in the whole universe, or hundreds of different formats with
their own processing tools that output a common format, or hundreds of
different formats with no common format and each with their own
test-runners, or something in between? and if it's anything other than
the first option, how will the processing tools be written and executed
and maintained?
(With my current approach, there's also the difficulty that nobody but
me understands the test format or the processing code, since they're
somewhat idiosyncratic, but hopefully that could be resolved if there
was some simplification and documentation...)
A few random comments about / potential additions (if people agree) to
<http://omocha.w3.org/wiki/wishes>:
Is avoidance of test duplication a goal? e.g. if two people
independently developed test suites for the same section of the spec,
would it be best to just stick all the test cases into the official test
suite (which is easy to do, and ensures as many requirements as possible
are tested, though some will be tested twice (and every test needs to be
reviewed and maintained)), or is it best to carefully merge them so
every test case is distinct and necessary? The same situation occurs if
e.g. a canvas test suite tests that videos can be drawn onto it, and a
video test suite tests that it can be drawn onto a canvas.
Duplicates aren't useful; the question is whether they are harmful, to
an extent that makes de-duplication worthwhile. I have no data or
experience to know the best balance, but it seems like something there
should eventually be a clear policy on.
It probably should always be possible to point people at a URL of a
single test, that executes the test in their browser and lets them know
if it's passed - that's very useful when submitting bug reports or
discussing bugs. Ideally no test should rely on an external test-runner
(though it should have one of those too).
The "under review" -> {"approved", "rejected"} approach doesn't quite
seem adequate, because specs will change (while in CR, or while in Rec
with errata) and approved tests might become invalid, but it would be
wasteful to send every single test back to "under review" just because
of a single change in the spec. Maybe there needs to be some "approved
and probably still valid since the spec changed but it shouldn't affect
this" and "approved and possibly invalid since the spec changed and
might affect this but nobody has reviewed it again carefully yet"
states, or similar, for things that just need a quick check before being
considered "approved" again.
Performance is not stated as a desire (except as a consequence of
parallelism), but it probably should be. E.g. originally the canvas
tests were imported into Mozilla's automated system as a load of
individual files, but they were merged into a single giant file
containing all the tests to minimise the page-loading overhead. The
faster the tests are, the more likely they are to be run, so it seems an
important concern.
Some tests are not strictly either script-verified or human-verified.
E.g. most of the canvas tests use getImageData to automatically verify
the output, but in some cases that might not work (a browser might not
implement getImageData at all, or might have bugs that prevent it
working in certain obscure situations). It doesn't seem helpful to
penalise browsers for getImageData bugs when the test is meant to be
testing something totally unrelated, so the tests dynamically fall back
on human verification. It would be nice to retain support for that,
instead of requiring tests to be predefined as either automatic or manual.
> Regards,
>
> Philippe
--
Philip Taylor
pjt47@cam.ac.uk
Received on Tuesday, 22 September 2009 15:40:27 UTC