Re: ANN: WebDataCommons.org releases 7.3 billion quads RDFa, Microdata and Microformat data originating from 2.29 million pay-level-domains from Christian Bizer on 2012-12-12 (public-lod@w3.org from December 2012)

From: Christian Bizer <chris@bizer.de>
Date: Wed, 12 Dec 2012 18:26:57 +0100
To: Aaron Bradley <aaranged@gmail.com>
CC: Christian Bizer <chris@bizer.de>, public-lod@w3.org, semantic-web@w3.org, Public Vocabs <public-vocabs@w3.org>
Message-ID: <50C8BE61.4070300@bizer.de>
Hi Aaron,

while I also suspect that the number of websites that offer RDFa and 
microdata has grown between 2009 and 2012, concluding this directly from 
the data won't be valid.

The problem is two-fold. On the one hand side, the number of pages that 
is included into the Common Crawl per website depends on the page rank 
of the pages, meaning that for some sites thousands of pages are 
included, for other sites only a small set. In both cases, the included 
pages are only a sample of the pages provided by the site.
Thus, drawing any conclusions from the absolute number of pages with 
data, absolute number of entities and number of triples is not valid.

What makes more scene is to compare the number of websites that support 
a specific format, because behind each website is the decision by some 
administrator to include specific markup into the HTML code.

The question is what you count as a website? For the 2009/2010 release, 
we counted all sub-domains as separate websites (which makes Wordpress a 
couple of million websites). For the 2012 release we only counted 
pay-level-domains as separate websites. Thus, the domain counts are not 
directly comparable.

So, the minimum that would need to be done to draw a conclusion would be 
to download the 2009/2010 WDC dataset and run our current statistics 
script over it (the script is in the subversion repository) so that you 
get comparable domain counts.

To be sure that the sets of websites included in Common Crawl 2009/2010 
and Common Crawl 2012 match to an high enough extent, one would also 
need to derive both sets from the Common Crawl and compare them. We have 
the list of all 40 million pay-level-domains covered by the Common Crawl 
2012 but we don't have the list for the 2009/2010 crawl. When we touch 
the Common Crawl 2009/2010 the next time, we will also generate and 
publish this list. But if you want quicker results, you would need to 
get your hands dirty yourself and invest the $100 Amazon fees to scan 
over the dataset.

Cheers,

Chris


Am 11.12.2012 19:07, schrieb Aaron Bradley:
> Thanks for this Christian and Robert.
>
> I've been comparing the 2009/2010 corpus against this, and have 
> noticed the comparative growth of RDFa and microdata (proportionally).
>
> Is this a valid apples-to-apples comparison?  That is, do the two data 
> sets (2009/2010 and Aug. 2012) reference the same domain set, or might 
> a comparison of the two data sets be ill-advised because of the 
> variance in items selected in each case?
>
>
> On Tue, Dec 11, 2012 at 6:48 AM, Christian Bizer <chris@bizer.de 
> <mailto:chris@bizer.de>> wrote:
>
>     Hi all,
>
>     more and more websites embed structured data describing for
>     instance products, people, organizations, places, events, resumes,
>     and cooking recipes into their HTML pages using markup formats
>     such as RDFa, Microdata and Microformats.
>
>     The Web Data Commons project extracts all Microformat, Microdata
>     and RDFa data from theCommon Crawl web corpus, the largest and
>     most up-to-data web corpus that is currently available to the
>     public, and provides the extracted data for download. In addition,
>     we calculate and publish statistics about the deployment of the
>     different formats as well as the vocabularies that are used
>     together with each format.
>
>     Today, we are happy to announce the release of a new
>     WebDataCommons dataset.
>
>     The dataset has been extracted from the latest version of the
>     Common Crawl. This August 2012 version of the
>     Common Crawl contains over 3 billion HTML pages which originate
>     from over 40 million websites (pay-level-domains).
>
>     Altogether we discovered structured data within 369 million HTML
>     pages contained in the Common Crawl corpus (12.3%). The pages
>     containing structured data originate from 2.29 million websites
>     (5.65%).  Approximately 519 thousand of these websites use RDFa,
>     while 140 thousand websites use Microdata. Microformats are used
>     on 1.7 million websites.
>
>     Basic statistics about the extracted dataset as well as the
>     vocabularies that are used together with each encoding format are
>     found at:
>
>     http://www.webdatacommons.org/2012-08/stats/stats.html
>
>     Additional statistics that analyze top-level domain distribution
>     and the popularity of the websites covered by the Common Crawl, as
>     well as the topical domains of the embedded data are found at:
>
>     http://www.webdatacommons.org/2012-08/stats/additional_stats.html
>
>     The overall size of the August 2012 WebDataCommons dataset is 7.3
>     billion quads. The dataset is split into 1,416 files each having a
>     size of around 100 MB. In order to make it easier to find data
>     from a specific website or top-level-domain, we provide indexes
>     about the location of specific data within the files.
>
>     In order to make it easy for third parties to investigate the
>     usage of different vocabularies and to generate seed-lists for
>     focused crawling endeavors, we provide a website-class-property
>     matrix for each format. The matrixes indicate which vocabulary
>     term (class/property) is used by which website and avoid that you
>     need to download and scan the whole dataset to obtain this
>     information.
>
>     The extracted dataset and website-class-property matrix can be
>     downloaded from:
>
>     http://www.webdatacommons.org/2012-08/stats/how_to_get_the_data.html
>
>     Lots of thanks to:
>
>     + the Common Crawl project for providing their great web crawl and
>     thus enabling the Web Data Commons project.
>     + the Any23 project for providing their great library of
>     structured data parsers.
>     + the PlanetData and the LOD2 EU research projects for supporting
>     WebDataCommons.
>
>     Have fun with the new dataset.
>
>     Cheers,
>
>     Christian Bizer and Robert Meusel
>
>
>     -- 
>     Prof. Dr. Christian Bizer
>     Chair of Information Systems V
>     Web-based Systems Group
>     Universität Mannheim
>     B6, 26, Room B1.15
>     D-68131 Mannheim
>     Tel.: +49(0)621/181-2677 <tel:%2B49%280%29621%2F181-2677>
>     Fax.: +49(0)621/181-2682 <tel:%2B49%280%29621%2F181-2682>
>     Mail: chris@informatik.uni-mannheim.de
>     <mailto:chris@informatik.uni-mannheim.de>
>     Web: www.bizer.de <http://www.bizer.de>
>
>
>
>
>
>
>
> -- 
> /Resistance is futile:  I have been assimilated./  Into the Google 
> world that is.  Accordingly, *my default email address is now 
> aaranged@gmail.com <mailto:aaranged@gmail.com>*. If I'm in your 
> address book as aaranged@yahoo.com <mailto:aaranged@yahoo.com> you may 
> want to update that detail (but Yahoo! mail will continue to be 
> forwarded to this account).
>
Received on Wednesday, 12 December 2012 17:26:08 UTC