Re: HTML5 Test Development Estimation from James Graham on 2013-08-01 (public-test-infra@w3.org from July to September 2013)

From: James Graham <james@hoppipolla.co.uk>
Date: Thu, 01 Aug 2013 12:15:18 +0100
To: Tobie Langel <tobie@w3.org>
Cc: Kevin Kershaw <K.Kershaw@cablelabs.com>, public-test-infra <public-test-infra@w3.org>, <public-html-media@w3.org>, "'public-html-testsuite@w3.org'" <public-html-testsuite@w3.org>, Takashi Hayakawa <T.Hayakawa@cablelabs.com>, Brian Otte <B.Otte@cablelabs.com>, Nishant Shah <N.Shah@cablelabs.com>
Message-ID: <e56063187e83d76d87c268e6611e8510@webmail.webfaction.com>
(apologies to people on public-html-testsuite who now have this reply 
many times.)

On 2013-08-01 03:08, Tobie Langel wrote:

> On Wednesday, July 31, 2013 at 12:35 AM, Kevin Kershaw wrote:

>> First off, our team is specifically looking at building tests for 
>> designated subsections of HTML5 section 4.8. We originally identified 
>> video, audio, track, and media elements in our scope but added the 
>> source element and Dimension Attributes because of the tight coupling 
>> we see between these. Also, we’ve excluded some Media Element 
>> subsections (e.g. MediaControl) for our initial work. We started a 
>> “bottom-up” analysis of the target sections, working to identify what 
>> looked to us to be “testable” requirements in the spec. The 
>> subsections of the spec itself divide up pretty nicely by individual 
>> paragraphs. That is, each paragraph usually lists one or more 
>> potential test conditions.. We did some basic tabulation of 
>> requirements within each paragraph to come up with a count of 
>> potential tests. I’ve included the spreadsheet we constructed to 
>> assist this process in this email. That sheet is pretty 
>> self-explanatory but if you have questions, I’m more than happy to 
>> answer. Our analysis was done by several different engineers, each of 
>> whom had slightly different ideas about how to count “tests” but the 
>> goal here was to produce an approximation, not a perfectly accurate 
>> list.
> It's great someone took the time for this bottom-up approach which
> will help validate our initial assumptions.

I don't know if it's intentional, but this spreadsheet hasn't been 
forwarded to public-test-infra.

For the purposes of comparison, there is already an submission of a 
large number of media tests written by Simon Pieters at Opera [1]. I'm 
mot sure if it covers exactly the same section of the spec that you are 
interested in, but it contains 672 files, which is > 672 tests (some 
files contain more than one test, and there seem to be relatively few 
support files). It does contain some testing of the IDL sections, but 
this makes sense as a) it likely predates idlharness.js and b) 
idlharness.js can't test everything.

One useful thing that you could do would be to review these tests as a 
first step; this would have several positive effects; it would allow you 
to compare your methodology for estimation to an existing submission, it 
would allow you to reduce the number of tests that you have to write, 
and would provide a good guide to the style of tests written by someone 
with a great deal of experience testing browser products and using the 
W3C infrastructure.

If you do decide to go ahead with this review, I would strongly suggest 
that you consider using the critic tool [2], since the submission is 
rather large, and critic has a number of features that will make this 
easier, notably the ability to mark which files have been reviewed and 
which issues have been addressed. On the other hand if you prefer to use 
the github UI that is OK as well.

To review media tests use critic you will need to set up a filter 
marking yourself as reviewer for 
"html/semantics/embedded-content-0/media-elements/". If you need help 
with critic (or indeed anything else), please ask me on #testing.

> The process you describe above seems sound, and I was at first quite
> surprised by the important difference between the output of the two
> methodologies. That is, until I looked at the estimated time you
> consider an engineer is going to take to write a test: 8 hours. We've
> accounted for 1h to write a test and 15 minutes to review it.

I would be interested to know what kind of thing you are thinking of 
when you talk about a "test". Is it a normal javascript/reftest of the 
kind that we are used to running on desktop browsers? I know that 
sometimes tests for consumer devices are more complex to write because 
there are extra requirements about running on production hardware + etc. 
Perhaps this kind of difference could account for the very different 
estimates of time per test?

>> · We excluded tests of the IDL from both the W3C and CableLabs 
>> estimates under the assumption that the IDLHarness will generate IDL 
>> tests automatically.

idlharnes.js can't autogenerate all interesting things. But it might be 
a reasonable first approximation.

>> · We accounted for some tests around algorithms but believe that many 
>> algorithm steps, especially intermediate steps, do not require 
>> separate tests.

Algorithms are black boxes; the requirement is that the UA behaviour is 
black-box indistinguishable from the algorithm. But this makes it very 
difficult to tell how many tests are needed; for example the HTML parser 
section of the spec is basically "parsers must act as if they follow the 
following algorithm: [huge multi-step state machine].". This requires 
thousands of tests. There are also tests that one can write that 
correspond to the ordering of steps in algorithms rather than the 
explicit steps themselves. For example if one has an algorithm that 
first measures the length of some input list and then does something for 
each index up to the initial measurement, one can write tests to ensure 
that the measurement happens before the list is accessed, that the 
expected thing happens if the list is mutated to be longer, or shorter, 
than the initial measurement, during iteration, and so on.

>> · We subtracted out the number of existing, approved tests in the 
>> GitHub repository that were associated with our target sections in 
>> order to come up with a count of “remaining” tests to be developed.

In this case there are a large number of unapproved tests, as discussed 
above.

>> · We assumed that a suitable test harness and driver will be 
>> available to run the set of developed tests. I understand there’s 
>> significant work to be done on that infrastructure but that’s not part 
>> of this little exercise.

Yeah. In particular browser vendors typically have their own automation 
harnesses.

[1] https://github.com/w3c/web-platform-tests/pull/93
[2] https://critic.hoppipolla.co.uk/r/74
Received on Thursday, 1 August 2013 11:15:45 UTC