Re: ANN: WebDataCommons.org - Offering 3.2 billion quads current RDFa, Microdata and Miroformat data extracted from 65.4 million websites from Martin Hepp on 2012-03-26 (public-vocabs@w3.org from March 2012)

From: Martin Hepp <martin.hepp@ebusiness-unibw.org>
Date: Mon, 26 Mar 2012 16:11:44 +0200
To: Chris Bizer <chris@bizer.de>
Cc: <public-vocabs@w3.org>
Message-Id: <D3826532-DCC0-4600-92E5-AC2252537E98@ebusiness-unibw.org>
Dear Chris, all:

Thanks for your hard work on this, and it surely gives some input for further research.

I want to stress, though, that the absolute numbers found are NOT representative for the Web as a whole, likely because of the limited coverage of the CommonCrawl corpus.

Just a simple example: The RDFa extractor details page

http://s3.amazonaws.com/webdatacommons-2/stats/top_classes_for_extractor_html-rdfa.html

says the extractors found 23,825 Entities of gr:Offering.

In comparison, the few sites listed at

http://wiki.goodrelations-vocabulary.org/Datasets

alone account for ca. 25,000,000 entities of that type, so obviously 1000 times more.

I am not criticizing your valuable work, I only want to prevent people to draw incorrect conclusions from the preliminary data, because the crawl does not seem to be a *representative* or anywhere near *complete* corpus. So do not not take the numbers for truth without any appropriate corrections for bias in the sample.

My suspicion is that the CommonCrawl crawler limits the crawling depth per site and thus does not harvest the article detail pages that typically hold the real RDFa markup. But that is just a guess.

Best wishes

Martin Hepp


On Mar 22, 2012, at 9:12 PM, Chris Bizer wrote:

> Hi all,
>  
> we are happy to announce WebDataCommons.org, a joined project of Freie Universität Berlin and the Karlsruhe Institute of Technology to extract all Microformat, Microdata and RDFa data from the Common Crawl web corpus, the largest and most up-to-data web corpus that is currently available to the public.
>  
> WebDataCommons.org provides the extracted data for download in the form of RDF-quads. In addition, we produce basic statistics about the extracted data.
>  
> Up till now, we have extracted data from two Common Crawl web corpora: One corpus consisting of 2.5 billion HTML pages dating from 2009/2010 and a second corpus consisting of 1.4 billion HTML pages dating from February 2012.
>  
> The 2009/2010 extraction resulted in 5.1 billion RDF quads which describe 1.5 billion entities and originate from 19.1 million websites.
> The February 2012 extraction resulted in 3.2 billion RDF quads which describe 1.2 billion entities and originate from 65.4 million websites.
>  
> More detailed statistics about the distribution of formats, entities and websites serving structured data, as well as growth between 2009/2010 and 2012 is provided on the project website:
>  
> http://webdatacommons.org/
>  
> It is interesting to see form the statistics that the RDFa and Microdata deployment has grown a lot over the last years, but that Microformat data still makes up the majority of the structured data that is embedded into HTML pages (when looking at the amount of quads as well as the amount of websites).
>  
> We hope that Web Data Commons will be useful to the community by:
> + easing the access to Mircodata, Mircoformat and RDFa data, as you do not need to crawl the Web yourself anymore in order to get access to a fair portion of the structured data that is currently available on the Web.
> + laying the foundation for the more detailed analysis of the deployment of the different technologies.
> + providing seed URLs for focused Web crawls that dig deeper into the websites that offer a specific type of data.
>  
> Web Data Commons is a joint effort of Christian Bizer and Hannes Mühleisen (Web-based Systems Group at Freie Universität Berlin) and Andreas Harth and Steffen Stadtmüller (Institute AIFB at the Karlsruhe Institute of Technology).
>  
> Lots of thanks to:
> + the Common Crawl project for providing their great web crawl and thus enabling the Web Data Commons project.
> + the Any23 project for providing their great library of structured data parsers.
> + the PlanetData and the LOD2 EU research projects which supported the extraction.
>  
> For the future, we plan to update the extracted datasets on a regular basis as new Common Crawl corpora are becoming available. We also plan to provide the extracted data in the in the form of CSV-tables for common entity types (e.g. product, organization, location, ...) in order to make it easier to mine the data.
>  
> Cheers,
>  
> Christian Bizer, Hannes Mühleisen, Andreas Harth and Steffen Stadtmüller
>  
>  
> --
> Prof. Dr. Christian Bizer
> Web-based Systems Group
> Freie Universität Berlin
> +49 30 838 55509
> http://www.bizer.de
> chris@bizer.de
>  

--------------------------------------------------------
martin hepp
e-business & web science research group
universitaet der bundeswehr muenchen

e-mail:  hepp@ebusiness-unibw.org
phone:   +49-(0)89-6004-4217
fax:     +49-(0)89-6004-4620
www:     http://www.unibw.de/ebusiness/ (group)
         http://www.heppnetz.de/ (personal)
skype:   mfhepp 
twitter: mfhepp

Check out GoodRelations for E-Commerce on the Web of Linked Data!
=================================================================
* Project Main Page: http://purl.org/goodrelations/
Received on Monday, 26 March 2012 14:12:14 UTC