Data in HTML Crawl

Hi All,

I thought I'd outline what the RDF Web Applications Working Group
(working on RDFa), the Data in HTML Task Force (working on RDFa and
Microdata) and the HTML Working Group (working on RDFa and Microdata)
hope to accomplish over the next several weeks via the Data Driven
Standards Community Group.

The background for why this group exists can be found here:

http://www.w3.org/community/data-driven-standards/2011/11/07/launch/

The first goal that we have for this group is to take some measurements
on how many pages use RDFa, Microdata and Microformats along with how
often each attribute/format is used and which features of the language
are most often / least often used. This data will help guide how RDFa
and Microdata should be modified before they become official W3C
Recommendations.

I have been in touch with the folks at Common Crawl, who will be helping
us to do the first crawl. For those not familiar with Common Crawl, this
is a good introduction to the non-profit service:

http://www.commoncrawl.org/

At this moment, Common Crawl has a fairly fresh index of the Web from
2011. The data size is roughly 40TB in size and is stored on Amazon S3
storage. To access the data, you have to use an Amazon EC2 instance to
read from S3 storage. Typically, a map-reduce job is run on the data via
Hadoop (via Amazon EC2 or Amazon Elastic Map Reduce). Processing 40TB of
data costs between $100-$200 per run, so it's important for us to have
our ducks in a row before we do the full crawl, as it is costly.

My company, Digital Bazaar, will fund the initial crawl (with hopefully
a few others pitching in for the first and subsequent crawls).

The first crawl will look for the frequency of
Microdata/RDFa/Microformats documents on the Web along with usage data
for each attribute/Microformats class. This crawl will help determine if
RDFa Lite 1.1 is going to break backwards compatibility in an
unacceptable way and will give us some usage figures on all three
languages. A wiki has been created to outline the types of tests we
intend to run:

http://www.w3.org/community/data-driven-standards/wiki/Data-in-html-crawl-design

We are currently waiting for some very simple example source code that
will help us write the Hadoop map/reduce job from the Common Crawl
folks. While we are waiting on that, we're going to try and get the
80legs.com folks involved as well. Having two data sources will help us
understand how much of the Web needs to be crawled in order to get solid
usage data.

For those interested in how this stuff looks in practice (and know
Python), Michael Noll has a good intro to the subject here:

http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/

So, I guess the first question is - are there any Hadoop gurus on this
mailing list?

-- manu

-- 
Manu Sporny (skype: msporny, twitter: manusporny)
Founder/CEO - Digital Bazaar, Inc.
blog: Standardizing Payment Links - Why Online Tipping has Failed
http://manu.sporny.org/2011/payment-links/

Received on Sunday, 20 November 2011 19:30:37 UTC