Re: Converting MDN Compat Data to JSON from Pat Tressel on 2014-05-09 (public-webplatform@w3.org from May 2014)

From: Pat Tressel <ptressel@myuw.net>
Date: Fri, 9 May 2014 07:05:50 -0700
To: Doug Schepers <schepers@w3.org>
Cc: David Kirstein <frozenice@frozenice.de>, Renoir Boulanger <renoir@w3.org>, WebPlatform Community <public-webplatform@w3.org>
Message-ID: <CABT-+2rOk-wbDU2ZaS+-_HFhW8p26bxaGL-_NEGCpg0rE7-nUw@mail.gmail.com>

It seems like Pat has run into problems with the Apache Nutch
> crawler/scraper that he proposed to use to gather the MDN compat data.
>

Is there a different Pat?  ;-)

I've been saying we should ditch nutch since it became clear we'd still
have to read the data out of hbase even after we fetch it.  Renoir is very
busy and doesn't have time to set up a place for me to run it on Linux.

Would you be able to modify your Node.js script to collect the data in a
> way that compensates for the issues we've encountered with Kuma?
>
> This is blocking progress on finishing the compat-table project, which we
> need to complete to announce the CSS property pages.
>

IMO we should do this from node.js.  The big advantage is that the pages
can be inserted in the cache.  A simple crawler is easy to write (up to
properly obeying the polite crawling rules, and those are not that hard)
*but* we don't even need to do that:  There are several node.js crawler
packages -- I posted links to some promising ones in the other thread, IIRC.

Note even after the crawl, there are going to be some more non-standard
pages that may not be dealt with by the current parsing -- I came across
some "interesting" ones while collecting a set of seed pages for the crawl.

-- Pat

Received on Friday, 9 May 2014 14:06:18 UTC