Re: Converting MDN Compat Data to JSON from Pat Tressel on 2014-01-12 (public-webplatform@w3.org from January 2014)

From: Pat Tressel <ptressel@myuw.net>
Date: Sun, 12 Jan 2014 04:21:16 -0800
To: David Kirstein <frozenice@frozenice.de>
Cc: Doug Schepers <schepers@w3.org>, WebPlatform Community <public-webplatform@w3.org>
Message-ID: <CABT-+2qZdC+LoZhNrRrBHqG8wDJEe+wgVWySL3nZh5fsUxWd3A@mail.gmail.com>

Hi, David!


> If you can wait until tomorrow, I can put a NodeJS version of the old
> script up, well slightly improved. I’ll need to get some sleep first, as
> it’s already a bit late here in Germany. ;)
>

Sure.  ;-)


> It basically generates a list of pages (via tags), grabs the compat tables
> in HTML format and converts them to a nice JS object.
>
That’s already working for basic tables, I’ll add support for prefixes.
>

Why don't we look at it first, in case it is possible to avoid the work --
see below...


> That <input> table is a beast, though!
>

At least <input> has a table that's in the expected format.  Here's one
without a table:

https://developer.mozilla.org/en-US/docs/Web/API/Document

Ooo, this one has footnotes to its compatibility table...

https://developer.mozilla.org/en-US/docs/Web/API/Element

Could write out lists of pages without the table, and pages have the table
but don't match expectation in some other way.


> The main thing left to do here is converting this internal JS object into
> something that resembles the spec Doug linked... and some ‘minor’ things,
> like taming the <input> compat table Janet linked and maybe find a better
> way to generate the list of pages (tag lists are limited to 500 entries,
> removing duplicates from the lists of ['CSS', 'HTML5', 'API', 'WebAPI'], I
> counted about 1.2k pages or so).
>

I was about to ask "What does tag mean in this context?"  :D  But I see the
"Tags" section (i.e. class = tag-list) at the bottom of pages.  So, you're
not crawling the relevant parts of the site looking for pages with an <a>
tag with href #Browser_compatibility?

No worries, no harm has or will be done to the MDN servers! There are
> sensible delays between requests and the tool has caches, so there are
> actually no MDN-requests needed to work on the HTML -> JS conversion or the
> conversion to WPD format. I’ll bundle a filled cache with gracefully
> requested data, so there should be enough data to work on for the start. :)
>

> Pat, can I interest you in working on the HTML parser or the conversion to
> WPD format? The HTML parsing is really easy, as it’s written in JS and uses
> a jQuery-like library, as you will see from my code.
>

Maybe I'm not understanding something, as it seems we should not need to do
any parsing.  If we get a page with XMLHttpRequest, then its responseXML is
a document object, which is already parsed.  We could use document methods
or actual jQuery ;-) at that point to select the elements containing the
compatibility info.  Or we could use XSLT to both select and reformat the
info, but that would probably be harder than writing procedural code for
the reformatting.  Ok, cancel the XSLT suggestion.

So, what am I missing?  Are we just using "parse" in different senses?

Catch me here or on IRC; if I’m not responding, I’m probably still
> sleeping! ;)
>

Your nick is in the channel but I won't ping you just yet, as it's 4am my
time, so I'm about to go fall over....zzzz...

-- Pat

Received on Sunday, 12 January 2014 12:21:43 UTC