RE: Converting MDN Compat Data to JSON

Heya,

 

If you can wait until tomorrow, I can put a NodeJS version of the old script
up, well slightly improved. I'll need to get some sleep first, as it's
already a bit late here in Germany. ;)

 

It basically generates a list of pages (via tags), grabs the compat tables
in HTML format and converts them to a nice JS object. That's already working
for basic tables, I'll add support for prefixes. That <input> table is a
beast, though!

 

The main thing left to do here is converting this internal JS object into
something that resembles the spec Doug linked... and some 'minor' things,
like taming the <input> compat table Janet linked and maybe find a better
way to generate the list of pages (tag lists are limited to 500 entries,
removing duplicates from the lists of ['CSS', 'HTML5', 'API', 'WebAPI'], I
counted about 1.2k pages or so).

 

No worries, no harm has or will be done to the MDN servers! There are
sensible delays between requests and the tool has caches, so there are
actually no MDN-requests needed to work on the HTML -> JS conversion or the
conversion to WPD format. I'll bundle a filled cache with gracefully
requested data, so there should be enough data to work on for the start. :)

 

Pat, can I interest you in working on the HTML parser or the conversion to
WPD format? The HTML parsing is really easy, as it's written in JS and uses
a jQuery-like library, as you will see from my code. Catch me here or on
IRC; if I'm not responding, I'm probably still sleeping! ;)

 

-fro

 

From: ptressel@uw.edu [mailto:ptressel@uw.edu] On Behalf Of Pat Tressel
Sent: Sonntag, 12. Januar 2014 01:10
To: Doug Schepers
Cc: WebPlatform Community
Subject: Re: Converting MDN Compat Data to JSON

 

 

Unfortunately, MDN doesn't expose their compatibility data as JSON, so we'll
need to convert their HTML tables into JSON that matches our data model [2].
We already have a script that collects the data (again, as HTML tables) from
their site, but we need someone who can reformat and normalize that data.

The language used for this task is not important: it could be JavaScript,
Python, Ruby, Perl, PHP, or even C. I believe that the best approach may use
RegEx, but there might be a better way.

 

... 

I'd be inclined to use Python and Beautiful Soup.  The latter works on
intact web pages -- I'm not sure about isolated elements, but it would be
simple enough to tack on a minimal set of <html>, <head>, <body> tags.
...

(I'd be equally inclined to use JavaScript and jQuery if I were set up to
use them outside a browser. ...

 

The "right" tool is probably XSLT.  But it would probably be faster to get
it working in Python / Beautiful Soup.  ;-)

-- Pat 

 

Received on Sunday, 12 January 2014 01:16:42 UTC