W3C home > Mailing lists > Public > public-vocabs@w3.org > December 2014

Re: ANN: Release of class-specific subsets of WebDataCommons Schema.org Data

From: Robert Meusel <robert@informatik.uni-mannheim.de>
Date: Fri, 05 Dec 2014 17:12:55 +0100
Message-ID: <5481D987.2060504@informatik.uni-mannheim.de>
To: Simon Spero <sesuncedu@gmail.com>
CC: W3C Web Schemas Task Force <public-vocabs@w3.org>

Thank you for pointing this out. Yes indeed all properties are extended 
by the type of the entity they belong to within the dataset. This is 
based on our extraction library, which was used to retrieve the RDF 
quads from the raw HTML pages.
The extension offers some benefits , which we exploited during our 
analysis of the data, e.g. its not necessary to load whole entities in 
order to explore which properties are used by webmasters for the 
different types. This made the processing much faster.

In order to remove this extension, it is possible, to recognize the type 
of the entity (always in the first line of a new bnode in the files) and 
replace the type within the following property by "http://schema.org", e.g.:

"http://schema.org/Corporation/name".replace("http://schema.org/Corporation", "http://schema.org")


Am 05.12.2014 15:32, schrieb Simon Spero:
> Robert -
> I notice that, in at least some of the sample data, the predicates 
> IRIs are formed by appending the property name to the class IRI - for 
> example:
> "<http://schema.org/Corporation/employees>".
> This conflicts with the schema.org <http://schema.org> magic sub 
> property rules.
> I believe that the propertyURI value for schema.org 
> <http://schema.org> is "vocabulary", which would generate the correct 
> predicate IRI <http://schema.org/employees>.
> As long as there were no sdo magic subproperties in the original data, 
> this should be simple  to fix in post.
> See: <http://www.w3.org/TR/microdata-rdf/#introduction>
> Simon
> On Dec 5, 2014 6:10 AM, "Robert Meusel" 
> <robert@informatik.uni-mannheim.de 
> <mailto:robert@informatik.uni-mannheim.de>> wrote:
>     Dear All,
>     The WebDataCommons team is happy to announce that we have released
>     several class-specific subsets of the Schema.org Data contained in
>     our Winter 2013 Microdata corpus [1]. We hope that providing those
>     topic-specific subsets for over 50 different Schema.org classes
>     (like product, event, or address) will make it easier for the
>     community to explore and work with the data.
>     The different datasets, along with some statistics about the data
>     can be found here:
>     http://webdatacommons.org/structureddata/2013-11/stats/schema_org_subsets.html
>     The subsets contain all instances of a specific class as well as
>     all other data that is found on the webpages containing these
>     instances. For example, a page containing data about a product
>     might also contain reviews and offers for this product; a page
>     containing data about an event might also contain data about the
>     location of the event and the persons involved in the event. The
>     data was originally extracted using Any23 [2] from the Winter 2013
>     crawl provided by the Common Crawl Foundation [3]. The extracted
>     data is represented in N-Quads [4] format, meaning that the forth
>     element of each quad contains the URL of the webpage from which
>     the data was extracted.
>     We thank the Common Crawl Foundation for providing their Web corpera.
>     Best Regards,
>     Chris, Heiko & Robert
>     [1] http://webdatacommons.org/structureddata/2013-11/stats/stats.html
>     [2] http://any23.apache.org
>     [3] http://commoncrawl.org
>     [4] http://www.w3.org/TR/n-quads/
>     -- 
>     Robert Meusel
>     Chair of Information Systems V
>     Web-based Systems Group
>     Universit├Ąt Mannheim
>     B6, 26, Room C1.04
>     D-68159 Mannheim
>     Phone:+49 621 181 2648  <tel:%2B49%20621%20181%202648>
>     Mail:robert@informatik.uni-mannheim.de  <mailto:robert@informatik.uni-mannheim.de>
>     Web:dws.informatik.uni-mannheim.de  <http://dws.informatik.uni-mannheim.de>
Received on Friday, 5 December 2014 16:13:25 UTC

This archive was generated by hypermail 2.3.1 : Tuesday, 6 January 2015 21:29:46 UTC