Re: ANN: - Offering 3.2 billion quads current RDFa, Microdata and Miroformat data extracted from 65.4 million websites

Hi Dan, Peter:

I think we are in agreement - the CommonCrawl project is nice and useful. I am not questioning that at all.
However, uses the CommonCrawl corpus, extracts all structured data found, and then

1. advocates the results as kind of definite statistics on "data on the Web" [1, 2] and
2. fuels hopes that others could rather directly use the extracted data for meaningful applications, quote

"Web Data Commons thus enables you to use the data without needing to crawl the Web yourself."

I claim that crawling the Web to 

1. make any meaningful statements about the amount and structure of RDFa, Microdata, and Microformat markup, or 
2. to build applications atop of that

requires a fundamentally different approach that reaches the deep detail pages with a potentially low pagerank, and that the current CommonCrawl is unsuited for this, because it typically misses the deep detail pages that contain the interesting markup.

A good indicator about the bias in the data is that it finds 
- 3.4 million instances of gd:Organization 
but just 
- 619 k gr:Product (slide 8). Ebay alone has 1 million products with markup, Bestbuy additional 450 k, Wayfair 2 million, 15 million, and so forth, and one would typically expect that you have 100...10,000 times more products than companies in your dataset.

Both CommonCrawl and are useful exercises, but I deeply miss a discussion of the limited usefulness of the data nor any mention of the limitations of the absolute numbers in the paper and Web page.



On Apr 17, 2012, at 7:22 PM, Dan Brickley wrote:

> On 17 April 2012 18:56, Peter Mika <> wrote:
>> Hi Martin,
>> It's not as simple as that, because PageRank is a probabilistic algorithm (it includes random jumps between pages), and I wouldn't expect that would include 2M links on a single page (that would be one very long webpage).
>> But again to reiterate the point, search engines would want to make sure that they index the main page more than they would want to index the detail pages.
>> You can do a site query to get a rough estimate of the ranking without a query string:
>> You will see that most of the pages are category pages. If you go to 2nd page and onward you will see an estimate of 1900 pages indexed.
>> Of course, I agree with you that a search engine focused on structured data, especial if domain-specific, might want to reach all the pages and index all the data. I'm just saying that current search engines don't, and CommonCrawl is mostly trying to approximate them (if I understand correctly what they are trying to do).
> According to
> "What do you intend to do with the crawled content?
> Our mission is to democratize access to web information by producing
> and maintaining an open repository of web crawl data that is
> universally accessible. We store the crawl data on Amazonís S3
> service, allowing it to be bulk downloaded as well as directly
> accessed for map-reduce processing in EC2."
> No mention of search as such. I'd imagine they're open to suggestions,
> and that the project (and crawl) could take various paths as it
> evolves. (With corresponding influence on the stats...).
> Our problem here is in figuring out what can be taken from such stats
> to help guide linked data vocabulary creation and management. Maybe
> others will do deeper focussed crawls, who knows? But it's great to
> see this focus on stats lately, I hope others have more to share.
> Dan

martin hepp
e-business & web science research group
universitaet der bundeswehr muenchen

phone:   +49-(0)89-6004-4217
fax:     +49-(0)89-6004-4620
www: (group) (personal)
skype:   mfhepp 
twitter: mfhepp

Check out GoodRelations for E-Commerce on the Web of Linked Data!
* Project Main Page:

Received on Tuesday, 17 April 2012 18:56:56 UTC