Re: Converting MDN Compat Data to JSON

Hi, Pat–

Sorry, I must have misunderstood you; I thought you said you had gotten 
busy and weren't able to work on this for now. My apologies if I got the 
wrong message, and it would be great if you are able to carry this forward!

Since this is a one-time thing, I don't much care if it's Apache Nutch, 
a custom Node.js library, a premade crawler, or a gaggle of well-trained 
4-year-olds that actually does the crawling and scraping… and I don't 
much mind who does it, either… I'm just interested in getting it done 
quickly.

Let's follow up on whatever technique gets us the results by, say, the 
end of next week… you think you can help there?

I'm sure we'll have funky output, but that's the next phase.

Regards-
-Doug

On 5/9/14 10:05 AM, Pat Tressel wrote:
>
>
>     It seems like Pat has run into problems with the Apache Nutch
>     crawler/scraper that he proposed to use to gather the MDN compat data.
>
>
> Is there a different Pat?  ;-)
>
> I've been saying we should ditch nutch since it became clear we'd still
> have to read the data out of hbase even after we fetch it.  Renoir is
> very busy and doesn't have time to set up a place for me to run it on Linux.
>
>     Would you be able to modify your Node.js script to collect the data
>     in a way that compensates for the issues we've encountered with Kuma?
>
>     This is blocking progress on finishing the compat-table project,
>     which we need to complete to announce the CSS property pages.
>
>
> IMO we should do this from node.js.  The big advantage is that the pages
> can be inserted in the cache.  A simple crawler is easy to write (up to
> properly obeying the polite crawling rules, and those are not that hard)
> *but* we don't even need to do that:  There are several node.js crawler
> packages -- I posted links to some promising ones in the other thread, IIRC.
>
> Note even after the crawl, there are going to be some more non-standard
> pages that may not be dealt with by the current parsing -- I came across
> some "interesting" ones while collecting a set of seed pages for the crawl.
>
> -- Pat

Received on Friday, 9 May 2014 16:38:17 UTC