Re: Converting MDN Compat Data to JSON from Pat Tressel on 2014-01-11 (public-webplatform@w3.org from January 2014)

From: Pat Tressel <ptressel@myuw.net>
Date: Sat, 11 Jan 2014 07:41:19 -0800
To: Doug Schepers <schepers@w3.org>
Cc: WebPlatform Community <public-webplatform@w3.org>
Message-ID: <CABT-+2qe33Sm4Pia1XkF07_49iu2V1nGP8xAhbng+pO-WEjg5g@mail.gmail.com>

Doug --

We need some coding help to convert HTML tables into JSON, for our
> compatibility data project.
>
> As I've explained elsewhere [1], we have several goals for our browser
> compatibility information:
>
> 1) collect the most accurate data we can, from multiple trusted sources
> 2) store the data all in JSON, available for anyone to use via our API
> 3) use a MediaWiki extension to automatically populate the right pages
> with their relevant data
>
> We've made some progress on this, such as developing a data model [2], but
> gotten stalled approaching the holidays. I'd like to find help to bring us
> across the finish line.
>
> We should do this in multiple passes. The first pass will simply be to
> populate the pages with at least one source of data; the best match for our
> page structure is MDN.
>
> Unfortunately, MDN doesn't expose their compatibility data as JSON, so
> we'll need to convert their HTML tables into JSON that matches our data
> model [2]. We already have a script that collects the data (again, as HTML
> tables) from their site, but we need someone who can reformat and normalize
> that data.
>
> The language used for this task is not important: it could be JavaScript,
> Python, Ruby, Perl, PHP, or even C. I believe that the best approach may
> use RegEx, but there might be a better way.
>
> So, I'm asking you all to help in one of a few ways:
>
> 1) If you think you might know how to do this, and have time and energy to
> see it through, please let us know!
>
> 2) If you think you might know someone who can help, please introduce us!
>
> 3) If you can't do the task, nor know someone who could, please help me
> refine this message so we can put the call out, explaining what we are
> doing and what we need.
>
> [1] http://lists.w3.org/Archives/Public/public-webplatform-
> tests/2013OctDec/0000.html
> [2] http://www.ronaldmansveld.nl/webplatform/compat_tables_datamodel.html
>
> Regards-
> -Doug
>

Does the script you have already crawl the site and pull out the tables
intact?

It looks like the MDN compatibility info is easily findable in their
pages.  (I spot-checked their HTML, CSS, and JavaScript references, and all
seem to have very regular structures.)  The desktop and mobile tables have
id = compat-desktop and id = compat-mobile, respectively.  Not all pages
have both desktop and mobile tables, though.  All the pages I looked at
only had a "Basic support" row -- I wonder if some have additional rows.

I'd be inclined to use Python and Beautiful Soup.  The latter works on
intact web pages -- I'm not sure about isolated elements, but it would be
simple enough to tack on a minimal set of <html>, <head>, <body> tags.

Or, we could operate on the original pages.  I'm a bit hesitant to run a
crawler without permission plus a look at MDN's robots.txt, so if you've
already got the complete set of tables, that may be better.  (I recently
ran wget to download a student's work from their web site, and apparently
that violated their hosting site's robots.txt, and the site blocked me!
Don't want that to happen again...)

(I'd be equally inclined to use JavaScript and jQuery if I were set up to
use them outside a browser.  Just last week I finally found out why I could
not run Windows Script Host versions of VBscript and JavaScript -- my
anti-virus software had damaged the relevant registry entries.  Fixed now.
I'd be interested to hear if folks are running JS / jQuery as a script
outside the browser, and which JS they use.  But that would be off-topic
here.)

-- Pat

Received on Saturday, 11 January 2014 15:41:47 UTC