RE: HTML5 Test Development Estimation from Kevin Kershaw on 2013-08-05 (public-html-media@w3.org from August 2013)

From: Kevin Kershaw <K.Kershaw@cablelabs.com>
Date: Mon, 5 Aug 2013 16:33:34 +0000
To: James Graham <james@hoppipolla.co.uk>, Tobie Langel <tobie@w3.org>
CC: public-test-infra <public-test-infra@w3.org>, "public-html-media@w3.org" <public-html-media@w3.org>, "'public-html-testsuite@w3.org'" <public-html-testsuite@w3.org>, Takashi Hayakawa <T.Hayakawa@CableLabs.com>, Brian Otte <B.Otte@cablelabs.com>, Nishant Shah <N.Shah@cablelabs.com>
Message-ID: <E557E34E53296846B3E3EDF9A8640B192366B883@EXCHANGE.cablelabs.com>
Hi James -

Thanks for the feedback.  We really appreciate it and are happy to hear more from others on these mailing lists.

Let me address a couple of points in your email

1) We're happy to (and will in the future) post questions like this to public-test-infra as well as public-html-media & public-html-testsuite.

2) We were aware of the tests in the GIT directories under web-platform-tests\old-tests\submission.  Besides Opera, there look to be applicable media tests from Google and Microsoft as well.  We understood all these required review and some validation before they could enter the "approved" area.  It was our intention to review and use these as we could.  That said, we hadn't undertaken an extensive review of anything in this area when I wrote the previous email.  We've looked a bit more now.

WRT the Opera tests, I assume that they're good tests and correctly validate important parts of html5 media behavior.  I am concerned that there's no apparent traceability back to the spec level requirements and not much embedded comment info in the test source.  In comparison, the tests under the Google and Microsoft submissions use the XML href tag to point back into the spec (although not always w/ the precision we'd like).  Without traceability, it's really tough to assess how much of the spec you have covered w/ testing.  Having some data about spec coverage is important to us.

3)  We'll look into the critic tool - 

4)  WRT "what is a test" and the time it takes to write one.   I'm not married to the 8 hours and I'd be happy to change it if subsequent experience warrents.  The estimate is my approximation based on my experience w/ this kind of spec compliance testing.  For me, it's the aggregate of the time you spend figuring out what the spec is saying, coming up with a set of programmatic steps, like a story, that will validate an implementation of spec's described behavior, coding up that  story in the required language and then testing the result.  Some tests take less than a day, sometimes you can clone a pattern to get lots of tests done quickly, but sometimes you might spend a couple of days covering a particularly gnarly scenario.

5) Algorithms were an interesting estimation case.  There was general agreement on our team that not all steps of an algorithm listed in the spec would have an associated test - at this point, no hard and fast rules but use common sense.  For example, look for substeps where the result are exposed programmatically.  If they aren't, maybe the result can be inferred from some later step in the algorithm.  Or, where a state transition occurs, test by verifying attributes associated with the final state only.

Let us know if you think I'm really off base on any of this.

Thanks & regards,  

Kevin Kershaw
CableLabs



-----Original Message-----
From: James Graham [mailto:james@hoppipolla.co.uk] 
Sent: Thursday, August 01, 2013 5:15 AM
To: Tobie Langel
Cc: Kevin Kershaw; public-test-infra; public-html-media@w3.org; 'public-html-testsuite@w3.org'; Takashi Hayakawa; Brian Otte; Nishant Shah
Subject: Re: HTML5 Test Development Estimation

(apologies to people on public-html-testsuite who now have this reply many times.)

On 2013-08-01 03:08, Tobie Langel wrote:

> On Wednesday, July 31, 2013 at 12:35 AM, Kevin Kershaw wrote:

>> First off, our team is specifically looking at building tests for 
>> designated subsections of HTML5 section 4.8. We originally identified 
>> video, audio, track, and media elements in our scope but added the 
>> source element and Dimension Attributes because of the tight coupling 
>> we see between these. Also, we’ve excluded some Media Element 
>> subsections (e.g. MediaControl) for our initial work. We started a 
>> “bottom-up” analysis of the target sections, working to identify what 
>> looked to us to be “testable” requirements in the spec. The 
>> subsections of the spec itself divide up pretty nicely by individual 
>> paragraphs. That is, each paragraph usually lists one or more 
>> potential test conditions.. We did some basic tabulation of 
>> requirements within each paragraph to come up with a count of 
>> potential tests. I’ve included the spreadsheet we constructed to 
>> assist this process in this email. That sheet is pretty 
>> self-explanatory but if you have questions, I’m more than happy to 
>> answer. Our analysis was done by several different engineers, each of 
>> whom had slightly different ideas about how to count “tests” but the 
>> goal here was to produce an approximation, not a perfectly accurate 
>> list.
> It's great someone took the time for this bottom-up approach which 
> will help validate our initial assumptions.

I don't know if it's intentional, but this spreadsheet hasn't been forwarded to public-test-infra.

For the purposes of comparison, there is already an submission of a large number of media tests written by Simon Pieters at Opera [1]. I'm mot sure if it covers exactly the same section of the spec that you are interested in, but it contains 672 files, which is > 672 tests (some files contain more than one test, and there seem to be relatively few support files). It does contain some testing of the IDL sections, but this makes sense as a) it likely predates idlharness.js and b) idlharness.js can't test everything.

One useful thing that you could do would be to review these tests as a first step; this would have several positive effects; it would allow you to compare your methodology for estimation to an existing submission, it would allow you to reduce the number of tests that you have to write, and would provide a good guide to the style of tests written by someone with a great deal of experience testing browser products and using the W3C infrastructure.

If you do decide to go ahead with this review, I would strongly suggest that you consider using the critic tool [2], since the submission is rather large, and critic has a number of features that will make this easier, notably the ability to mark which files have been reviewed and which issues have been addressed. On the other hand if you prefer to use the github UI that is OK as well.

To review media tests use critic you will need to set up a filter marking yourself as reviewer for "html/semantics/embedded-content-0/media-elements/". If you need help with critic (or indeed anything else), please ask me on #testing.

> The process you describe above seems sound, and I was at first quite 
> surprised by the important difference between the output of the two 
> methodologies. That is, until I looked at the estimated time you 
> consider an engineer is going to take to write a test: 8 hours. We've 
> accounted for 1h to write a test and 15 minutes to review it.

I would be interested to know what kind of thing you are thinking of when you talk about a "test". Is it a normal javascript/reftest of the kind that we are used to running on desktop browsers? I know that sometimes tests for consumer devices are more complex to write because there are extra requirements about running on production hardware + etc. 
Perhaps this kind of difference could account for the very different estimates of time per test?

>> · We excluded tests of the IDL from both the W3C and CableLabs 
>> estimates under the assumption that the IDLHarness will generate IDL 
>> tests automatically.

idlharnes.js can't autogenerate all interesting things. But it might be a reasonable first approximation.

>> · We accounted for some tests around algorithms but believe that many 
>> algorithm steps, especially intermediate steps, do not require 
>> separate tests.

Algorithms are black boxes; the requirement is that the UA behaviour is black-box indistinguishable from the algorithm. But this makes it very difficult to tell how many tests are needed; for example the HTML parser section of the spec is basically "parsers must act as if they follow the following algorithm: [huge multi-step state machine].". This requires thousands of tests. There are also tests that one can write that correspond to the ordering of steps in algorithms rather than the explicit steps themselves. For example if one has an algorithm that first measures the length of some input list and then does something for each index up to the initial measurement, one can write tests to ensure that the measurement happens before the list is accessed, that the expected thing happens if the list is mutated to be longer, or shorter, than the initial measurement, during iteration, and so on.

>> · We subtracted out the number of existing, approved tests in the 
>> GitHub repository that were associated with our target sections in 
>> order to come up with a count of “remaining” tests to be developed.

In this case there are a large number of unapproved tests, as discussed above.

>> · We assumed that a suitable test harness and driver will be 
>> available to run the set of developed tests. I understand there’s 
>> significant work to be done on that infrastructure but that’s not 
>> part of this little exercise.

Yeah. In particular browser vendors typically have their own automation harnesses.

[1] https://github.com/w3c/web-platform-tests/pull/93

[2] https://critic.hoppipolla.co.uk/r/74
Received on Monday, 5 August 2013 16:34:08 UTC