Re: ANN: Release of class-specific subsets of WebDataCommons Schema.org Data from Simon Spero on 2014-12-05 (public-vocabs@w3.org from December 2014)

From: Simon Spero <sesuncedu@gmail.com>
Date: Fri, 5 Dec 2014 09:32:17 -0500
To: Robert Meusel <robert@informatik.uni-mannheim.de>
Cc: W3C Web Schemas Task Force <public-vocabs@w3.org>
Message-ID: <CADE8KM5USa9oOZXdkWHFNbyVodMa8NiWT9VNMUb+QKnd5=iD8g@mail.gmail.com>

Robert -
I notice that, in at least some of the sample data, the predicates IRIs are
formed by appending the property name to the class IRI - for example:
"<http://schema.org/Corporation/employees>".
This conflicts with the schema.org magic sub property rules.

I believe that the propertyURI value for schema.org is "vocabulary", which
would generate the correct predicate IRI <http://schema.org/employees>.

As long as there were no sdo magic subproperties in the original data, this
should be simple  to fix in post.

See: <http://www.w3.org/TR/microdata-rdf/#introduction>

Simon
On Dec 5, 2014 6:10 AM, "Robert Meusel" <robert@informatik.uni-mannheim.de>
wrote:

>  Dear All,
>
> The WebDataCommons team is happy to announce that we have released several
> class-specific subsets of the Schema.org Data contained in our Winter 2013
> Microdata corpus [1]. We hope that providing those topic-specific subsets
> for over 50 different Schema.org classes (like product, event, or address)
> will make it easier for the community to explore and work with the data.
>
> The different datasets, along with some statistics about the data can be
> found here:
> http://webdatacommons.org/structureddata/2013-11/stats/schema_org_subsets.html
>
> The subsets contain all instances of a specific class as well as all other
> data that is found on the webpages containing these instances. For example,
> a page containing data about a product might also contain reviews and
> offers for this product; a page containing data about an event might also
> contain data about the location of the event and the persons involved in
> the event. The data was originally extracted using Any23 [2] from the
> Winter 2013 crawl provided by the Common Crawl Foundation [3]. The
> extracted data is represented in N-Quads [4] format, meaning that the forth
> element of each quad contains the URL of the webpage from which the data
> was extracted.
>
> We thank the Common Crawl Foundation for providing their Web corpera.
>
> Best Regards,
>
> Chris, Heiko & Robert
>
>
>
> [1] http://webdatacommons.org/structureddata/2013-11/stats/stats.html
> [2] http://any23.apache.org
> [3] http://commoncrawl.org
> [4] http://www.w3.org/TR/n-quads/
>
> --
> Robert Meusel
> Chair of Information Systems V
> Web-based Systems Group
> Universität Mannheim
> B6, 26, Room C1.04
> D-68159 Mannheim
> Phone: +49 621 181 2648
> Mail: robert@informatik.uni-mannheim.de
> Web: dws.informatik.uni-mannheim.de
>
>

Received on Friday, 5 December 2014 14:32:43 UTC