Re: Web Crawl Regexes for RDFa from Niklas Lindström on 2011-11-13 (public-rdfa-wg@w3.org from November 2011)

From: Niklas Lindström <lindstream@gmail.com>
Date: Sun, 13 Nov 2011 18:09:49 +0100
To: Manu Sporny <msporny@digitalbazaar.com>
Cc: RDFa WG <public-rdfa-wg@w3.org>
Message-ID: <CADjV5jfV2g-0wAHSewFPfXX=fccOTZrJYXQu01BSa280m=z45w@mail.gmail.com>

Hi!

I've been thinking a bit about this. While we might get somewhere
using regexps, they have to get quite complex to handle the random
order in which attributes appear combined with our needs of matching
*missing* attributes (such as "@typeof and @property on the same
element, but not any other RDFa property"). Also the engine must treat
them as multiline to handle elements with linebreaks between or within
attributes.

I'm not saying it can't be done, but I'm wondering if the EC2 Hadoop
setup can be leveraged to do something a bit more structured.

The Amazon Elastic MapReduce tutorials mention means for running
Python, Ruby or PHP in the map step, so I expect it might be. Perhaps
using xsltproc (with the "--html" option, or with a tidy in front of
it) is possible as well. I chose that (since it is very fast) make a
simple example. The result is an XSLT which at the moment creates TSV
lines with statistics for each element using RDFa (attributes used, is
there an active hanging rel, etc.). This could be piped to a reduce
algorithm for computing answers to the questions we need, or be
adapted to something more directly usable.

I put this as a gist here:

    https://gist.github.com/1362314

(I've run the script against a local copy of the RDFa testsuite,
downloaded using the RDFLib test script [1].)

Just a thought.

Best regards,
Niklas

[1]: http://code.google.com/p/rdflib/source/browse/test/rdfa/run_w3c_rdfa_testsuite.py

On Tue, Nov 8, 2011 at 5:40 PM, Manu Sporny <msporny@digitalbazaar.com> wrote:
> I started a page for the new Web Crawl Regexes that will measure RDFa usage
> in the wild, and give us a better idea if the RDFa Lite changes we're
> thinking of making will break existing content out there:
>
> The page is hosted in the Data Driven Standards WG wiki, so you'll have to
> join that group if you want to edit the wiki:
>
> http://www.w3.org/community/data-driven-standards/wiki/Data-in-html-crawl-design
>
> There isn't much there right now, but it's a start. The plan is to turn
> these regexes into a Hadoop map/reduce job and run it on the Amazon Elastic
> Map Reduce infrastructure on the Common Crawl dataset (5 billion web pages,
> tens of terabytes of web page data).
>
> -- manu
>
> --
> Manu Sporny (skype: msporny, twitter: manusporny)
> Founder/CEO - Digital Bazaar, Inc.
> blog: Standardizing Payment Links - Why Online Tipping has Failed
> http://manu.sporny.org/2011/payment-links/
>
>

Received on Sunday, 13 November 2011 17:10:46 UTC