Re: Coverage analysis

On Feb 11, 2013, at 12:04 PM, Robin Berjon wrote:

> {snip}
> 
> Here, we couldn't use the same script for all specs because it has to understand specific conventions for things that are examples, non-normative, etc. In many cases, we could probably handle this without PhantomJS. For the HTML spec (and all specs derived directly from that source) we're looking at such a markup nightmare (the sections aren't marked up as such, you essentially have to resort to DOM Ranges to extract content usefully) that PhantomJS really is the only option.
> 
> I think there's no reason to despair though. If the HTML spec could be analysed, then others will be easier. For instance, all the required information is clearly marked up. We should be able to have a small number of spec "styles" that we can set up and use.
> 
> The output from that is spec-data-*.json. Assuming we can solve the above issue (which we can, it's just a bit of work), this too can be automated and dumped to a DB.

Robin,

Take a look at the spec parser I wrote for Shepherd[1]. It was designed to find all the anchors (to match test rel=help links against) but it also finds all the sections, identifies non-normative sections, and classifies anchor types ('section', 'dfn', 'abbr', 'other'). It finds all the sections in the HTML5 spec just fine (AFAICT), along with SVG and all the CSS specs I've thrown at it so far. It stores all the data in a DB (independent from Shepherd) and Shepherd has a JSON api for getting the spec data:
Sections only:
http://test.csswg.org/shepherd/api/spec?spec=html5&sections=1
All Anchors:
http://test.csswg.org/shepherd/api/spec?spec=html5&anchors=1

It should be fairly straightforward to extend this to gather the additional data you're scraping from the specs. My thinking was that we should host a common DB for all the spec data for the other tools to use.

Peter

[1] http://hg.csswg.org/dev/shepherd/file/tip/python/SynchronizeSpec.py

Received on Monday, 11 February 2013 22:27:13 UTC