Web Crawl Regexes for RDFa

I started a page for the new Web Crawl Regexes that will measure RDFa 
usage in the wild, and give us a better idea if the RDFa Lite changes 
we're thinking of making will break existing content out there:

The page is hosted in the Data Driven Standards WG wiki, so you'll have 
to join that group if you want to edit the wiki:

http://www.w3.org/community/data-driven-standards/wiki/Data-in-html-crawl-design

There isn't much there right now, but it's a start. The plan is to turn 
these regexes into a Hadoop map/reduce job and run it on the Amazon 
Elastic Map Reduce infrastructure on the Common Crawl dataset (5 billion 
web pages, tens of terabytes of web page data).

-- manu

-- 
Manu Sporny (skype: msporny, twitter: manusporny)
Founder/CEO - Digital Bazaar, Inc.
blog: Standardizing Payment Links - Why Online Tipping has Failed
http://manu.sporny.org/2011/payment-links/

Received on Tuesday, 8 November 2011 16:40:57 UTC