Re: Spec parser from Peter Linss on 2013-06-05 (public-test-infra@w3.org from April to June 2013)

From: Peter Linss <peter.linss@hp.com>
Date: Tue, 4 Jun 2013 23:15:13 -0700
To: Tobie Langel <tobie@w3.org>
Cc: public-test-infra <public-test-infra@w3.org>, ext Robin Berjon <robin@w3.org>
Message-Id: <C8F5A281-EAFB-46A0-B8D6-1F6272DA1AD5@hp.com>

On May 29, 2013, at 1:25 AM, Tobie Langel wrote:

> On Wednesday, May 29, 2013 at 10:19 AM, Linss, Peter wrote:
>>>> It wouldn't be very hard to make a stand-alone spec manager web app if you want a canonical instance running somewhere in W3C space, it would also be trivial to install a full Shepherd instance and simply not use the rest of it… (Shepherd is pretty much self-installing and self-maintaining these days, you just need a LAMP stack and a handful of Python libraries, like html5lib). I'd be happy to help with either.
>>> 
>>> I was hoping for something slightly more portable: a script I could pipe text to and which would return some tree I could then transform to JSON.
>> 
>> I could factor out the parser itself from the rest of the script, the bulk of the parser really just walks a DOM tree and the DB access is fairly isolated. I'm at the TAG f2f this week and the CSS f2f next, I might be able to scrape some time out next week but it'll probably be after.
> 
> That would be really sweet. Given the core value of this script is how it handles the gruesome discrepancies between specs, it would be great to be able to simply reuse it (and contribute to it) rather than reinvent the wheel.

Done. I pulled the spec parsing bits into [1], all the Shepherd specific code remains in SynchronizeSpec.py. There are no dependencies on any other Shepherd code.

To use it, instantiate a SpecificationParser (with an optional ui object if you want debug/warning messages), then call parseSpec() and/or parseDraft() (in that order). Afterwards call postProcess() (to connect the section structure and aggregate the statistics) and then getRootSection(). From the root section you can traverse the tree of section objects, each of which also has a list of anchors.

For both parseSpec() and parseDraft() it's best to pass the full URL to the spec including the file name of the main page (e.g. http://www.w3.org/TR/css3-background/Overview.html ), passing the file name is optional but  better prevents some anchor duplication in multi-page specs. (For multi-page specs you only need to call for the main page, it will find and load all the other pages.) You can also pass an optional callback which gets a dict of all the http headers from the GET of the initial page, returning False will prevent the spec from getting parsed (Shepherd uses this to compare last-modified-date, etc…)

If you call both parseSpec() and parseDraft(), the result will be a union of the sections and anchors from both (those present only in the draft are flagged as such), the stats will be from the draft where the same sections/anchors exist.

One caveat: for the HTML parser there's a StringReader() helper class used to interface with the older version of html5lib that's currently on our server. If using the current version of html5lib you may need to remove that (I factored the parsing into a separate method so it can be easily overridden).

Peter

[1] http://hg.csswg.org/dev/shepherd/file/tip/python/shepherd/specificationparser.py

Received on Wednesday, 5 June 2013 06:15:41 UTC