Compat Table Data from MDN

Hi, folks–

We originally thought that we had gotten all the compat table data from 
MDN, using Frozenice's importer [1], but it turns out that data was 
incomplete.

The importer relied on the feeds from various topic tags (e.g. HTML, 
CSS, SVG), but those feeds were limited to around 500 pages. So, Pat 
Tressel volunteered to create a crawler/scraper (based on Apache Nutch) 
that would retrieve the full list of pages that have compatibility 
information; she made some progress with this, but ran into deployment 
problems (if I understand correctly).

I talked today with Frozenice, Pat, and Renoir, and we came up with a 
plan that doesn't get all the data, but does get most of it that we're 
interested in for the short term.

I visited the index pages for each of the main topics that seem to have 
compat tables: CSS (including properties and selectors), HTML Elements, 
HTML Attributes, SVG Elements, SVG Attributes, DOM Interfaces, 
JavaScript APIs, and JavaScript. I manually collected all the URLs to 
the pages for those topics (thanks to a clever console hack from Renoir 
that made it a snap), and collated them together (see attached file 
'page-list.txt').

We can use this list of pages as a poor-man's crawler for Frozenice's 
importer. (Fro says "I think we only need to get rid of 
https://github.com/webplatform/mdn-compat-importer/blob/master/index.js#L25 
and put the master list into reader.links instead".)

Pat did more research on long-tail MDN pages that may be candidates for 
other useful compat-table info (and also on pages that turned out to be 
dead ends), which I'm attaching as 'seed-page-list.txt'. We should look 
more at that for the next phases.

So, for the first phase, we'll go with those pages that are listed in 
the page-list attachment; please go through those and see if there are 
conspicuous inclusions and exclusions.

Hopefully, we can have meaningful results in a week or two.

[1] https://github.com/webplatform/mdn-compat-importer

Regards-
-Doug

Received on Wednesday, 21 May 2014 23:30:06 UTC