Fwd: ANN: WebDataCommons.org releases 7.3 billion quads RDFa, Microdata and Microformat data originating from 2.29 million pay-level-domains from Melvin Carvalho on 2012-12-11 (public-rww@w3.org from December 2012)

From: Melvin Carvalho <melvincarvalho@gmail.com>
Date: Tue, 11 Dec 2012 15:55:08 +0100
To: public-rww <public-rww@w3.org>
Message-ID: <CAKaEYhKFpeM21J=5_aRmZ9NFuthtErUqQccGSc8pCk6AOjcbcw@mail.gmail.com>
I'm a big fan of this initiative.  You can further imagine the web data
commons becoming searchable and a source of claims and assertions using the
open web assumption.

Dont forget that reverse search (e.g. google) is just as popular as forward
search (e.g. directories)

Thanks to projects like webdatacommons.  It's may be possible to imagine
the same thing with data, in the near future.

---------- Forwarded message ----------
From: Christian Bizer <chris@bizer.de>
Date: 11 December 2012 15:48
Subject: ANN: WebDataCommons.org releases 7.3 billion quads RDFa, Microdata
and Microformat data originating from 2.29 million pay-level-domains
To: public-lod@w3.org, semantic-web@w3.org, public-vocabs@w3.org


Hi all,

more and more websites embed structured data describing for instance
products, people, organizations, places, events, resumes, and cooking
recipes into their HTML pages using markup formats such as RDFa, Microdata
and Microformats.

The Web Data Commons project extracts all Microformat, Microdata and RDFa
data from theCommon Crawl web corpus, the largest and most up-to-data web
corpus that is currently available to the public, and provides the
extracted data for download. In addition, we calculate and publish
statistics about the deployment of the different formats as well as the
vocabularies that are used together with each format.

Today, we are happy to announce the release of a new WebDataCommons dataset.

The dataset has been extracted from the latest version of the Common Crawl.
This August 2012 version of the
Common Crawl contains over 3 billion HTML pages which originate from over
40 million websites (pay-level-domains).

Altogether we discovered structured data within 369 million HTML pages
contained in the Common Crawl corpus (12.3%). The pages containing
structured data originate from 2.29 million websites (5.65%).
 Approximately 519 thousand of these websites use RDFa, while 140 thousand
websites use Microdata. Microformats are used on 1.7 million websites.

Basic statistics about the extracted dataset as well as the vocabularies
that are used together with each encoding format are found at:

http://www.webdatacommons.org/**2012-08/stats/stats.html<http://www.webdatacommons.org/2012-08/stats/stats.html>

Additional statistics that analyze top-level domain distribution and the
popularity of the websites covered by the Common Crawl, as well as the
topical domains of the embedded data are found at:

http://www.webdatacommons.org/**2012-08/stats/additional_**stats.html<http://www.webdatacommons.org/2012-08/stats/additional_stats.html>

The overall size of the August 2012 WebDataCommons dataset is 7.3 billion
quads. The dataset is split into 1,416 files each having a size of around
100 MB. In order to make it easier to find data from a specific website or
top-level-domain, we provide indexes about the location of specific data
within the files.

In order to make it easy for third parties to investigate the usage of
different vocabularies and to generate seed-lists for focused crawling
endeavors, we provide a website-class-property matrix for each format. The
matrixes indicate which vocabulary term (class/property) is used by which
website and avoid that you need to download and scan the whole dataset to
obtain this information.

The extracted dataset and website-class-property matrix can be downloaded
from:

http://www.webdatacommons.org/**2012-08/stats/how_to_get_the_**data.html<http://www.webdatacommons.org/2012-08/stats/how_to_get_the_data.html>

Lots of thanks to:

+ the Common Crawl project for providing their great web crawl and thus
enabling the Web Data Commons project.
+ the Any23 project for providing their great library of structured data
parsers.
+ the PlanetData and the LOD2 EU research projects for supporting
WebDataCommons.

Have fun with the new dataset.

Cheers,

Christian Bizer and Robert Meusel


-- 
Prof. Dr. Christian Bizer
Chair of Information Systems V
Web-based Systems Group
Universität Mannheim
B6, 26, Room B1.15
D-68131 Mannheim
Tel.: +49(0)621/181-2677
Fax.: +49(0)621/181-2682
Mail: chris@informatik.uni-mannheim.**de <chris@informatik.uni-mannheim.de>
Web: www.bizer.de
Received on Tuesday, 11 December 2012 14:55:50 UTC