W3C home > Mailing lists > Public > public-rdfa-wg@w3.org > November 2011

Web Crawl Regexes for RDFa

From: Manu Sporny <msporny@digitalbazaar.com>
Date: Tue, 08 Nov 2011 11:40:17 -0500
Message-ID: <4EB95B71.8070306@digitalbazaar.com>
To: RDFa WG <public-rdfa-wg@w3.org>
I started a page for the new Web Crawl Regexes that will measure RDFa 
usage in the wild, and give us a better idea if the RDFa Lite changes 
we're thinking of making will break existing content out there:

The page is hosted in the Data Driven Standards WG wiki, so you'll have 
to join that group if you want to edit the wiki:


There isn't much there right now, but it's a start. The plan is to turn 
these regexes into a Hadoop map/reduce job and run it on the Amazon 
Elastic Map Reduce infrastructure on the Common Crawl dataset (5 billion 
web pages, tens of terabytes of web page data).

-- manu

Manu Sporny (skype: msporny, twitter: manusporny)
Founder/CEO - Digital Bazaar, Inc.
blog: Standardizing Payment Links - Why Online Tipping has Failed
Received on Tuesday, 8 November 2011 16:40:57 UTC

This archive was generated by hypermail 2.4.0 : Friday, 17 January 2020 17:05:27 UTC