- From: Simon Spero <sesuncedu@gmail.com>
- Date: Fri, 5 Dec 2014 09:32:17 -0500
- To: Robert Meusel <robert@informatik.uni-mannheim.de>
- Cc: W3C Web Schemas Task Force <public-vocabs@w3.org>
- Message-ID: <CADE8KM5USa9oOZXdkWHFNbyVodMa8NiWT9VNMUb+QKnd5=iD8g@mail.gmail.com>
Robert - I notice that, in at least some of the sample data, the predicates IRIs are formed by appending the property name to the class IRI - for example: "<http://schema.org/Corporation/employees>". This conflicts with the schema.org magic sub property rules. I believe that the propertyURI value for schema.org is "vocabulary", which would generate the correct predicate IRI <http://schema.org/employees>. As long as there were no sdo magic subproperties in the original data, this should be simple to fix in post. See: <http://www.w3.org/TR/microdata-rdf/#introduction> Simon On Dec 5, 2014 6:10 AM, "Robert Meusel" <robert@informatik.uni-mannheim.de> wrote: > Dear All, > > The WebDataCommons team is happy to announce that we have released several > class-specific subsets of the Schema.org Data contained in our Winter 2013 > Microdata corpus [1]. We hope that providing those topic-specific subsets > for over 50 different Schema.org classes (like product, event, or address) > will make it easier for the community to explore and work with the data. > > The different datasets, along with some statistics about the data can be > found here: > http://webdatacommons.org/structureddata/2013-11/stats/schema_org_subsets.html > > The subsets contain all instances of a specific class as well as all other > data that is found on the webpages containing these instances. For example, > a page containing data about a product might also contain reviews and > offers for this product; a page containing data about an event might also > contain data about the location of the event and the persons involved in > the event. The data was originally extracted using Any23 [2] from the > Winter 2013 crawl provided by the Common Crawl Foundation [3]. The > extracted data is represented in N-Quads [4] format, meaning that the forth > element of each quad contains the URL of the webpage from which the data > was extracted. > > We thank the Common Crawl Foundation for providing their Web corpera. > > Best Regards, > > Chris, Heiko & Robert > > > > [1] http://webdatacommons.org/structureddata/2013-11/stats/stats.html > [2] http://any23.apache.org > [3] http://commoncrawl.org > [4] http://www.w3.org/TR/n-quads/ > > -- > Robert Meusel > Chair of Information Systems V > Web-based Systems Group > Universität Mannheim > B6, 26, Room C1.04 > D-68159 Mannheim > Phone: +49 621 181 2648 > Mail: robert@informatik.uni-mannheim.de > Web: dws.informatik.uni-mannheim.de > >
Received on Friday, 5 December 2014 14:32:43 UTC