- From: Robert Meusel <robert@informatik.uni-mannheim.de>
- Date: Fri, 05 Dec 2014 13:48:23 +0100
- To: semantic-web@w3.org
- Message-ID: <5481A997.5070401@informatik.uni-mannheim.de>
Dear All, The WebDataCommons team is happy to announce that we have released several class-specific subsets of the Schema.org Data contained in our Winter 2013 Microdata corpus [1]. We hope that providing those topic-specific subsets for over 50 different Schema.org classes (like product, event, or address) will make it easier for the community to explore and work with the data. The different datasets, along with some statistics about the data can be found here: http://webdatacommons.org/structureddata/2013-11/stats/schema_org_subsets.html The subsets contain all instances of a specific class as well as all other data that is found on the webpages containing these instances. For example, a page containing data about a product might also contain reviews and offers for this product; a page containing data about an event might also contain data about the location of the event and the persons involved in the event. The data was originally extracted using Any23 [2] from the Winter 2013 crawl provided by the Common Crawl Foundation [3]. The extracted data is represented in N-Quads [4] format, meaning that the forth element of each quad contains the URL of the webpage from which the data was extracted. We thank the Common Crawl Foundation for providing their Web corpera. Best Regards, Chris, Heiko & Robert [1] http://webdatacommons.org/structureddata/2013-11/stats/stats.html [2] http://any23.apache.org [3] http://commoncrawl.org [4] http://www.w3.org/TR/n-quads/ -- Robert Meusel Chair of Information Systems V Web-based Systems Group Universität Mannheim B6, 26, Room C1.04 D-68159 Mannheim Phone: +49 621 181 2648 Mail: robert@informatik.uni-mannheim.de Web: dws.informatik.uni-mannheim.de
Received on Friday, 5 December 2014 12:48:47 UTC