W3C home > Mailing lists > Public > public-vocabs@w3.org > December 2014

ANN: Release of class-specific subsets of WebDataCommons Schema.org Data

From: Robert Meusel <robert@informatik.uni-mannheim.de>
Date: Fri, 05 Dec 2014 12:07:13 +0100
Message-ID: <548191E1.90600@informatik.uni-mannheim.de>
To: public-vocabs@w3.org
Dear All,

The WebDataCommons team is happy to announce that we have released 
several class-specific subsets of the Schema.org Data contained in our 
Winter 2013 Microdata corpus [1]. We hope that providing those 
topic-specific subsets for over 50 different Schema.org classes (like 
product, event, or address) will make it easier for the community to 
explore and work with the data.

The different datasets, along with some statistics about the data can be 
found here: 
http://webdatacommons.org/structureddata/2013-11/stats/schema_org_subsets.html

The subsets contain all instances of a specific class as well as all 
other data that is found on the webpages containing these instances. For 
example, a page containing data about a product might also contain 
reviews and offers for this product; a page containing data about an 
event might also contain data about the location of the event and the 
persons involved in the event. The data was originally extracted using 
Any23 [2] from the Winter 2013 crawl provided by the Common Crawl 
Foundation [3]. The extracted data is represented in N-Quads [4] format, 
meaning that the forth element of each quad contains the URL of the 
webpage from which the data was extracted.

We thank the Common Crawl Foundation for providing their Web corpera.


Best Regards,

Chris, Heiko & Robert

[1] http://webdatacommons.org/structureddata/2013-11/stats/stats.html
[2] http://any23.apache.org
[3] http://commoncrawl.org
[4] http://www.w3.org/TR/n-quads/

-- 
Robert Meusel
Chair of Information Systems V
Web-based Systems Group
Universitšt Mannheim
B6, 26, Room C1.04
D-68159 Mannheim
Phone: +49 621 181 2648
Mail: robert@informatik.uni-mannheim.de
Web: dws.informatik.uni-mannheim.de
Received on Friday, 5 December 2014 11:07:38 UTC

This archive was generated by hypermail 2.3.1 : Tuesday, 6 January 2015 21:29:46 UTC