Re: ANN: WebDataCommons.org releases 7.3 billion quads RDFa, Microdata and Microformat data originating from 2.29 million pay-level-domains from Stéphane Corlosquet on 2012-12-11 (public-vocabs@w3.org from December 2012)

From: Stéphane Corlosquet <scorlosquet@gmail.com>
Date: Tue, 11 Dec 2012 16:56:16 -0500
To: Christian Bizer <chris@bizer.de>
Cc: public-lod@w3.org, semantic-web@w3.org, public-vocabs@w3.org
Message-ID: <CAGR+nnE=tL2aacZabBkhJx=6DuYp2oP4tg3-oVVcGP2XD9Z1Xg@mail.gmail.com>

Thanks for sharing these statistics and dataset.

The Extraction Statistic File Header (tab-separated file) [1] includes a
column 'referencedData' which is not explained in the list of columns
headers at [2]. What does it correspond to?

Steph.

[1] http://www.webdatacommons.org/samples/pages.header.tab
[2]
http://www.webdatacommons.org/2012-08/stats/how_to_get_the_data.html#toc3

On Tue, Dec 11, 2012 at 9:48 AM, Christian Bizer <chris@bizer.de> wrote:

> Hi all,
>
> more and more websites embed structured data describing for instance
> products, people, organizations, places, events, resumes, and cooking
> recipes into their HTML pages using markup formats such as RDFa, Microdata
> and Microformats.
>
> The Web Data Commons project extracts all Microformat, Microdata and RDFa
> data from theCommon Crawl web corpus, the largest and most up-to-data web
> corpus that is currently available to the public, and provides the
> extracted data for download. In addition, we calculate and publish
> statistics about the deployment of the different formats as well as the
> vocabularies that are used together with each format.
>
> Today, we are happy to announce the release of a new WebDataCommons
> dataset.
>
> The dataset has been extracted from the latest version of the Common
> Crawl. This August 2012 version of the
> Common Crawl contains over 3 billion HTML pages which originate from over
> 40 million websites (pay-level-domains).
>
> Altogether we discovered structured data within 369 million HTML pages
> contained in the Common Crawl corpus (12.3%). The pages containing
> structured data originate from 2.29 million websites (5.65%).
>  Approximately 519 thousand of these websites use RDFa, while 140 thousand
> websites use Microdata. Microformats are used on 1.7 million websites.
>
> Basic statistics about the extracted dataset as well as the vocabularies
> that are used together with each encoding format are found at:
>
> http://www.webdatacommons.org/**2012-08/stats/stats.html<http://www.webdatacommons.org/2012-08/stats/stats.html>
>
> Additional statistics that analyze top-level domain distribution and the
> popularity of the websites covered by the Common Crawl, as well as the
> topical domains of the embedded data are found at:
>
> http://www.webdatacommons.org/**2012-08/stats/additional_**stats.html<http://www.webdatacommons.org/2012-08/stats/additional_stats.html>
>
> The overall size of the August 2012 WebDataCommons dataset is 7.3 billion
> quads. The dataset is split into 1,416 files each having a size of around
> 100 MB. In order to make it easier to find data from a specific website or
> top-level-domain, we provide indexes about the location of specific data
> within the files.
>
> In order to make it easy for third parties to investigate the usage of
> different vocabularies and to generate seed-lists for focused crawling
> endeavors, we provide a website-class-property matrix for each format. The
> matrixes indicate which vocabulary term (class/property) is used by which
> website and avoid that you need to download and scan the whole dataset to
> obtain this information.
>
> The extracted dataset and website-class-property matrix can be downloaded
> from:
>
> http://www.webdatacommons.org/**2012-08/stats/how_to_get_the_**data.html<http://www.webdatacommons.org/2012-08/stats/how_to_get_the_data.html>
>
> Lots of thanks to:
>
> + the Common Crawl project for providing their great web crawl and thus
> enabling the Web Data Commons project.
> + the Any23 project for providing their great library of structured data
> parsers.
> + the PlanetData and the LOD2 EU research projects for supporting
> WebDataCommons.
>
> Have fun with the new dataset.
>
> Cheers,
>
> Christian Bizer and Robert Meusel
>
>
> --
> Prof. Dr. Christian Bizer
> Chair of Information Systems V
> Web-based Systems Group
> Universität Mannheim
> B6, 26, Room B1.15
> D-68131 Mannheim
> Tel.: +49(0)621/181-2677
> Fax.: +49(0)621/181-2682
> Mail: chris@informatik.uni-mannheim.**de<chris@informatik.uni-mannheim.de>
> Web: www.bizer.de
>
>
>
>
>


-- 
Steph.

Received on Tuesday, 11 December 2012 21:56:49 UTC