- From: Manu Sporny <msporny@digitalbazaar.com>
- Date: Sun, 20 Nov 2011 14:30:06 -0500
- To: Data-Driven Standards <public-data-driven-standards@w3.org>
Hi All, I thought I'd outline what the RDF Web Applications Working Group (working on RDFa), the Data in HTML Task Force (working on RDFa and Microdata) and the HTML Working Group (working on RDFa and Microdata) hope to accomplish over the next several weeks via the Data Driven Standards Community Group. The background for why this group exists can be found here: http://www.w3.org/community/data-driven-standards/2011/11/07/launch/ The first goal that we have for this group is to take some measurements on how many pages use RDFa, Microdata and Microformats along with how often each attribute/format is used and which features of the language are most often / least often used. This data will help guide how RDFa and Microdata should be modified before they become official W3C Recommendations. I have been in touch with the folks at Common Crawl, who will be helping us to do the first crawl. For those not familiar with Common Crawl, this is a good introduction to the non-profit service: http://www.commoncrawl.org/ At this moment, Common Crawl has a fairly fresh index of the Web from 2011. The data size is roughly 40TB in size and is stored on Amazon S3 storage. To access the data, you have to use an Amazon EC2 instance to read from S3 storage. Typically, a map-reduce job is run on the data via Hadoop (via Amazon EC2 or Amazon Elastic Map Reduce). Processing 40TB of data costs between $100-$200 per run, so it's important for us to have our ducks in a row before we do the full crawl, as it is costly. My company, Digital Bazaar, will fund the initial crawl (with hopefully a few others pitching in for the first and subsequent crawls). The first crawl will look for the frequency of Microdata/RDFa/Microformats documents on the Web along with usage data for each attribute/Microformats class. This crawl will help determine if RDFa Lite 1.1 is going to break backwards compatibility in an unacceptable way and will give us some usage figures on all three languages. A wiki has been created to outline the types of tests we intend to run: http://www.w3.org/community/data-driven-standards/wiki/Data-in-html-crawl-design We are currently waiting for some very simple example source code that will help us write the Hadoop map/reduce job from the Common Crawl folks. While we are waiting on that, we're going to try and get the 80legs.com folks involved as well. Having two data sources will help us understand how much of the Web needs to be crawled in order to get solid usage data. For those interested in how this stuff looks in practice (and know Python), Michael Noll has a good intro to the subject here: http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/ So, I guess the first question is - are there any Hadoop gurus on this mailing list? -- manu -- Manu Sporny (skype: msporny, twitter: manusporny) Founder/CEO - Digital Bazaar, Inc. blog: Standardizing Payment Links - Why Online Tipping has Failed http://manu.sporny.org/2011/payment-links/
Received on Sunday, 20 November 2011 19:30:37 UTC