Coverage analysis from Robin Berjon on 2013-02-11 (public-html-testsuite@w3.org from February 2013)

From: Robin Berjon <robin@w3.org>
Date: Mon, 11 Feb 2013 16:47:39 +0100
To: "'public-html-testsuite@w3.org'" <public-html-testsuite@w3.org>
CC: public-test-infra <public-test-infra@w3.org>
Message-ID: <5119129B.7010305@w3.org>

Hi all,

a couple of weeks ago we had a meeting about testing. One of the things 
that came out of it was that it would helpful to get a feel for the 
coverage level that we have for specs, and for larger specs to have that 
coverage per section, along with other measures to contrast the number 
of tests with.

I've now done this analysis for the HTML and Canvas specs (I would have 
done Microdata too, but it doesn't seem to have approved tests yet).

You can see it here, but be warned that you might not understand it 
without reading the notes below:

     http://w3c-test.org/html-testsuite/master/tools/coverage/

I'm copying public-test-infra; in case anyone wants to do the same for 
other specs I'd be happy to collaborate. If people think it would be 
useful to provide such data on a regular basis, we can certainly 
automate it. Note that for this purpose having the data in one big repo 
would help.

Some notes:

• I used the master specs, which means that this data is actually for 
5.1 rather than 5.0. I can of course run the same to target the 5.0 CR 
(and will). It makes no different to the script.

• I'm not claiming that all the metrics shown are useful. I'm including 
them because they were reasonably easy to extract (the hard part here is 
actually figuring out what's a section in the spec's body). Mike 
suggested that "number of examples" could also be used, which I think is 
an idea worth exploring.

• The metrics work this way:
   - number of words: I'm basically splitting on a simplistic idea of 
word boundary. I don't think it matters because we're not doing NLP.
   - RFC2119: I'm looking for both must and should, and giving them 
equal weight. It could be argued that one could disregard should, but it 
could equally be argued that any manner of optionality actually requires 
more testing.
   - algorithm steps: I'm counting "ol li". I think this is actually one 
of the most useful metrics.
   - IDL item: I remove empty lines, comments, lines that just close a 
structure (e.g. };) and then just count the lines. I could do something 
more complex based on a parser, but I don't think it would give 
different results.

• Some parts are weird: I essentially remove every section that is 
marked as "non-normative". In some cases (e.g. the introduction) all 
subsections of a section are non-normative, but the section itself isn't 
marked that way. I'll fix my algorithm to further remove sections that 
are left just having a title. I'll also special-case things like 
references and acknowledgements that aren't marked as NN but should be 
removed.

• The non-normative removal is rather simple too. Any section that 
flagged as non-normative, examples, IDL fragments (restated from the 
complete thing), "DOMintro" stuff gets removed.

• I index specifications at a maximum section depth of 3 (this matches 
the directory depth used in the test suite). The first form on the page 
allows you to get a higher-level view.

• I picked *completely* arbitrary thresholds for deciding whether the 
various metrics are flagged good or bad. You can change them in the form.

• Canvas is looking good even with relatively stringent settings. HTML 
less so :)

-- 
Robin Berjon - http://berjon.com/ - @robinberjon

Received on Monday, 11 February 2013 15:47:52 UTC