RE: ANN: WebDataCommons.org releases 7.3 billion quads RDFa, Microdata and Microformat data originating from 2.29 million pay-level-domains

Hi Steph,

 

the header „referencedData“ corresponds to the Regex we use during the
extraction process to detect the presence of a microformat in a HTML page
[1]. The column states the corresponding format of regex which first
matched.

 

Robert

 

[1] http://webdatacommons.org/#toc4

 

From: Stéphane Corlosquet [mailto:scorlosquet@gmail.com] 
Sent: Dienstag, 11. Dezember 2012 22:56
To: Christian Bizer
Cc: public-lod@w3.org; semantic-web@w3.org; public-vocabs@w3.org
Subject: Re: ANN: WebDataCommons.org releases 7.3 billion quads RDFa,
Microdata and Microformat data originating from 2.29 million
pay-level-domains

 

Thanks for sharing these statistics and dataset.

 

The Extraction Statistic File Header (tab-separated file) [1] includes a
column 'referencedData' which is not explained in the list of columns
headers at [2]. What does it correspond to?

 

Steph.

 

[1] http://www.webdatacommons.org/samples/pages.header.tab

[2]
http://www.webdatacommons.org/2012-08/stats/how_to_get_the_data.html#toc3

 

On Tue, Dec 11, 2012 at 9:48 AM, Christian Bizer <chris@bizer.de> wrote:

Hi all,

more and more websites embed structured data describing for instance
products, people, organizations, places, events, resumes, and cooking
recipes into their HTML pages using markup formats such as RDFa, Microdata
and Microformats.

The Web Data Commons project extracts all Microformat, Microdata and RDFa
data from theCommon Crawl web corpus, the largest and most up-to-data web
corpus that is currently available to the public, and provides the extracted
data for download. In addition, we calculate and publish statistics about
the deployment of the different formats as well as the vocabularies that are
used together with each format.

Today, we are happy to announce the release of a new WebDataCommons dataset.

The dataset has been extracted from the latest version of the Common Crawl.
This August 2012 version of the
Common Crawl contains over 3 billion HTML pages which originate from over 40
million websites (pay-level-domains).

Altogether we discovered structured data within 369 million HTML pages
contained in the Common Crawl corpus (12.3%). The pages containing
structured data originate from 2.29 million websites (5.65%).  Approximately
519 thousand of these websites use RDFa, while 140 thousand websites use
Microdata. Microformats are used on 1.7 million websites.

Basic statistics about the extracted dataset as well as the vocabularies
that are used together with each encoding format are found at:

http://www.webdatacommons.org/2012-08/stats/stats.html

Additional statistics that analyze top-level domain distribution and the
popularity of the websites covered by the Common Crawl, as well as the
topical domains of the embedded data are found at:

http://www.webdatacommons.org/2012-08/stats/additional_stats.html

The overall size of the August 2012 WebDataCommons dataset is 7.3 billion
quads. The dataset is split into 1,416 files each having a size of around
100 MB. In order to make it easier to find data from a specific website or
top-level-domain, we provide indexes about the location of specific data
within the files.

In order to make it easy for third parties to investigate the usage of
different vocabularies and to generate seed-lists for focused crawling
endeavors, we provide a website-class-property matrix for each format. The
matrixes indicate which vocabulary term (class/property) is used by which
website and avoid that you need to download and scan the whole dataset to
obtain this information.

The extracted dataset and website-class-property matrix can be downloaded
from:

http://www.webdatacommons.org/2012-08/stats/how_to_get_the_data.html

Lots of thanks to:

+ the Common Crawl project for providing their great web crawl and thus
enabling the Web Data Commons project.
+ the Any23 project for providing their great library of structured data
parsers.
+ the PlanetData and the LOD2 EU research projects for supporting
WebDataCommons.

Have fun with the new dataset.

Cheers,

Christian Bizer and Robert Meusel


-- 
Prof. Dr. Christian Bizer
Chair of Information Systems V
Web-based Systems Group
Universität Mannheim
B6, 26, Room B1.15
D-68131 Mannheim
Tel.: +49(0)621/181-2677 <tel:%2B49%280%29621%2F181-2677> 
Fax.: +49(0)621/181-2682 <tel:%2B49%280%29621%2F181-2682> 
Mail: chris@informatik.uni-mannheim.de
Web: www.bizer.de









 

-- 
Steph.

Received on Thursday, 13 December 2012 10:47:16 UTC