W3C home > Mailing lists > Public > public-rdfa-wg@w3.org > November 2011

Web Crawl Regexes for RDFa

From: Manu Sporny <msporny@digitalbazaar.com>
Date: Tue, 08 Nov 2011 11:40:17 -0500
Message-ID: <4EB95B71.8070306@digitalbazaar.com>
To: RDFa WG <public-rdfa-wg@w3.org>
I started a page for the new Web Crawl Regexes that will measure RDFa 
usage in the wild, and give us a better idea if the RDFa Lite changes 
we're thinking of making will break existing content out there:

The page is hosted in the Data Driven Standards WG wiki, so you'll have 
to join that group if you want to edit the wiki:

http://www.w3.org/community/data-driven-standards/wiki/Data-in-html-crawl-design

There isn't much there right now, but it's a start. The plan is to turn 
these regexes into a Hadoop map/reduce job and run it on the Amazon 
Elastic Map Reduce infrastructure on the Common Crawl dataset (5 billion 
web pages, tens of terabytes of web page data).

-- manu

-- 
Manu Sporny (skype: msporny, twitter: manusporny)
Founder/CEO - Digital Bazaar, Inc.
blog: Standardizing Payment Links - Why Online Tipping has Failed
http://manu.sporny.org/2011/payment-links/
Received on Tuesday, 8 November 2011 16:40:57 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Friday, 27 April 2012 04:55:18 GMT