W3C home > Mailing lists > Public > public-vocabs@w3.org > December 2014

Re: ANN: Release of class-specific subsets of WebDataCommons Schema.org Data

From: Simon Spero <sesuncedu@gmail.com>
Date: Fri, 5 Dec 2014 09:32:17 -0500
Message-ID: <CADE8KM5USa9oOZXdkWHFNbyVodMa8NiWT9VNMUb+QKnd5=iD8g@mail.gmail.com>
To: Robert Meusel <robert@informatik.uni-mannheim.de>
Cc: W3C Web Schemas Task Force <public-vocabs@w3.org>
Robert -
I notice that, in at least some of the sample data, the predicates IRIs are
formed by appending the property name to the class IRI - for example:
This conflicts with the schema.org magic sub property rules.

I believe that the propertyURI value for schema.org is "vocabulary", which
would generate the correct predicate IRI <http://schema.org/employees>.

As long as there were no sdo magic subproperties in the original data, this
should be simple  to fix in post.

See: <http://www.w3.org/TR/microdata-rdf/#introduction>

On Dec 5, 2014 6:10 AM, "Robert Meusel" <robert@informatik.uni-mannheim.de>

>  Dear All,
> The WebDataCommons team is happy to announce that we have released several
> class-specific subsets of the Schema.org Data contained in our Winter 2013
> Microdata corpus [1]. We hope that providing those topic-specific subsets
> for over 50 different Schema.org classes (like product, event, or address)
> will make it easier for the community to explore and work with the data.
> The different datasets, along with some statistics about the data can be
> found here:
> http://webdatacommons.org/structureddata/2013-11/stats/schema_org_subsets.html
> The subsets contain all instances of a specific class as well as all other
> data that is found on the webpages containing these instances. For example,
> a page containing data about a product might also contain reviews and
> offers for this product; a page containing data about an event might also
> contain data about the location of the event and the persons involved in
> the event. The data was originally extracted using Any23 [2] from the
> Winter 2013 crawl provided by the Common Crawl Foundation [3]. The
> extracted data is represented in N-Quads [4] format, meaning that the forth
> element of each quad contains the URL of the webpage from which the data
> was extracted.
> We thank the Common Crawl Foundation for providing their Web corpera.
> Best Regards,
> Chris, Heiko & Robert
> [1] http://webdatacommons.org/structureddata/2013-11/stats/stats.html
> [2] http://any23.apache.org
> [3] http://commoncrawl.org
> [4] http://www.w3.org/TR/n-quads/
> --
> Robert Meusel
> Chair of Information Systems V
> Web-based Systems Group
> Universit├Ąt Mannheim
> B6, 26, Room C1.04
> D-68159 Mannheim
> Phone: +49 621 181 2648
> Mail: robert@informatik.uni-mannheim.de
> Web: dws.informatik.uni-mannheim.de
Received on Friday, 5 December 2014 14:32:43 UTC

This archive was generated by hypermail 2.3.1 : Tuesday, 6 January 2015 21:29:46 UTC