- From: Niklas Lindström <lindstream@gmail.com>
- Date: Sun, 13 Nov 2011 18:09:49 +0100
- To: Manu Sporny <msporny@digitalbazaar.com>
- Cc: RDFa WG <public-rdfa-wg@w3.org>
Hi! I've been thinking a bit about this. While we might get somewhere using regexps, they have to get quite complex to handle the random order in which attributes appear combined with our needs of matching *missing* attributes (such as "@typeof and @property on the same element, but not any other RDFa property"). Also the engine must treat them as multiline to handle elements with linebreaks between or within attributes. I'm not saying it can't be done, but I'm wondering if the EC2 Hadoop setup can be leveraged to do something a bit more structured. The Amazon Elastic MapReduce tutorials mention means for running Python, Ruby or PHP in the map step, so I expect it might be. Perhaps using xsltproc (with the "--html" option, or with a tidy in front of it) is possible as well. I chose that (since it is very fast) make a simple example. The result is an XSLT which at the moment creates TSV lines with statistics for each element using RDFa (attributes used, is there an active hanging rel, etc.). This could be piped to a reduce algorithm for computing answers to the questions we need, or be adapted to something more directly usable. I put this as a gist here: https://gist.github.com/1362314 (I've run the script against a local copy of the RDFa testsuite, downloaded using the RDFLib test script [1].) Just a thought. Best regards, Niklas [1]: http://code.google.com/p/rdflib/source/browse/test/rdfa/run_w3c_rdfa_testsuite.py On Tue, Nov 8, 2011 at 5:40 PM, Manu Sporny <msporny@digitalbazaar.com> wrote: > I started a page for the new Web Crawl Regexes that will measure RDFa usage > in the wild, and give us a better idea if the RDFa Lite changes we're > thinking of making will break existing content out there: > > The page is hosted in the Data Driven Standards WG wiki, so you'll have to > join that group if you want to edit the wiki: > > http://www.w3.org/community/data-driven-standards/wiki/Data-in-html-crawl-design > > There isn't much there right now, but it's a start. The plan is to turn > these regexes into a Hadoop map/reduce job and run it on the Amazon Elastic > Map Reduce infrastructure on the Common Crawl dataset (5 billion web pages, > tens of terabytes of web page data). > > -- manu > > -- > Manu Sporny (skype: msporny, twitter: manusporny) > Founder/CEO - Digital Bazaar, Inc. > blog: Standardizing Payment Links - Why Online Tipping has Failed > http://manu.sporny.org/2011/payment-links/ > >
Received on Sunday, 13 November 2011 17:10:46 UTC