RE: Converting MDN Compat Data to JSON

Well, the WebPlatform organization (https://github.com/webplatform) seems like the logical choice. There already are several projects there and more will come.

 

I’m sure Doug will figure something out. :)

 

-fro

 

From: Max Polk [mailto:maxpolk@gmail.com] 
Sent: Sonntag, 12. Januar 2014 15:14
To: Pat Tressel
Cc: Webplatform List; frozenice@frozenice.de; Doug Schepers
Subject: Re: Converting MDN Compat Data to JSON

 

We are in need of a place, perhaps in github, for all the tools we are building.  Some Python and shell scripting I have in a private repo, but webplatform needs a educated space for it and someone to keep it organized, so others later can use them for future needs.

Is there already such a place?

On Jan 12, 2014 7:21 AM, Pat Tressel <ptressel@myuw.net> wrote:

Hi, David!
 

If you can wait until tomorrow, I can put a NodeJS version of the old script up, well slightly improved. I’ll need to get some sleep first, as it’s already a bit late here in Germany. ;)

 

Sure.  ;-)

 

It basically generates a list of pages (via tags), grabs the compat tables in HTML format and converts them to a nice JS object. 

That’s already working for basic tables, I’ll add support for prefixes. 

 

Why don't we look at it first, in case it is possible to avoid the work -- see below...
 

That <input> table is a beast, though!

 

At least <input> has a table that's in the expected format.  Here's one without a table:

https://developer.mozilla.org/en-US/docs/Web/API/Document

Ooo, this one has footnotes to its compatibility table...

https://developer.mozilla.org/en-US/docs/Web/API/Element

Could write out lists of pages without the table, and pages have the table but don't match expectation in some other way.

 

The main thing left to do here is converting this internal JS object into something that resembles the spec Doug linked... and some ‘minor’ things, like taming the <input> compat table Janet linked and maybe find a better way to generate the list of pages (tag lists are limited to 500 entries, removing duplicates from the lists of ['CSS', 'HTML5', 'API', 'WebAPI'], I counted about 1.2k pages or so).

 

I was about to ask "What does tag mean in this context?"  :D  But I see the "Tags" section (i.e. class = tag-list) at the bottom of pages.  So, you're not crawling the relevant parts of the site looking for pages with an <a> tag with href #Browser_compatibility?

No worries, no harm has or will be done to the MDN servers! There are sensible delays between requests and the tool has caches, so there are actually no MDN-requests needed to work on the HTML -> JS conversion or the conversion to WPD format. I’ll bundle a filled cache with gracefully requested data, so there should be enough data to work on for the start. :)

 

Pat, can I interest you in working on the HTML parser or the conversion to WPD format? The HTML parsing is really easy, as it’s written in JS and uses a jQuery-like library, as you will see from my code. 

 

Maybe I'm not understanding something, as it seems we should not need to do any parsing.  If we get a page with XMLHttpRequest, then its responseXML is a document object, which is already parsed.  We could use document methods or actual jQuery ;-) at that point to select the elements containing the compatibility info.  Or we could use XSLT to both select and reformat the info, but that would probably be harder than writing procedural code for the reformatting.  Ok, cancel the XSLT suggestion.

So, what am I missing?  Are we just using "parse" in different senses?

 

Catch me here or on IRC; if I’m not responding, I’m probably still sleeping! ;)

 

Your nick is in the channel but I won't ping you just yet, as it's 4am my time, so I'm about to go fall over....zzzz...

 

-- Pat

Received on Sunday, 12 January 2014 14:46:06 UTC