Re: SKOS, controlled vocab, and open world assumption from David Booth on 2013-04-06 (semantic-web@w3.org from April 2013)

From: David Booth <david@dbooth.org>
Date: Sat, 06 Apr 2013 17:18:35 -0400
To: Eric Prud'hommeaux <eric@w3.org>
CC: Dave Reynolds <dave.e.reynolds@gmail.com>, semantic-web@w3.org
Message-ID: <5160912B.8050606@dbooth.org>

On 04/06/2013 01:21 PM, Eric Prud'hommeaux wrote:
> What
> we'd like for "validation" is for JJC to label his notion of ice cream
> flavors and someone else to extend it in a way that a 3rd party can
> can accept amanda:Chocolate but reject jjc:Choco999late. Any candidates
> or starting points?

I favor the approach of providing validation tests as a set of SPARQL 
queries against the RDF data:

  - The simplest form would be to use an ASK query, which returns a 
true/false value, to indicate whether the test passed or failed.  ASK is 
good for verifying the presence of expected data.

  - For constraint checking, a better form is to use a CONSTRUCT query, 
using the SPIN constraint checking style:
http://spinrdf.org/spin.html#spin-constraint-construct
CONSTRUCT is better for this because it can return information about the 
reason why the test failed, which is very helpful for debugging 
purposes.  If the CONSTRUCT query returns nothing

There are big benefits in using RDF and SPARQL for this purpose:

  - The tests are resilient to the presence of extra information.  This 
means that additional data, vocabularies and ontologies can be mixed in, 
without affecting existing information access or tests.

  - All tests are written in the same, common language, regardless of 
the underlying data model that they test.  This makes it very easy to 
share and deploy new tests.

  - Different constraints can be defined for different purposes, and 
kept separate from the data.  It is helpful to break validation into two 
kinds, depending on one's role as data producer or data consumer. 
Quoting from "RDF and SOA", these two kinds of validation are:
http://dbooth.org/2007/rdf-and-soa/rdf-and-soa-paper.htm#data-validation
it :
[[
  - Model integrity (defined by the producer).  This is to ensure that 
the instance makes sense: that it conforms to the producer's intent, 
which in part may be constrained by contractual obligations to 
consumers.  Since a data producer is responsible for generating the data 
it sends, it should supply a way to check model integrity.  This 
validator may be useful to both producers and consumers.  However, 
because the model may change over time (as it is versioned), the 
consumer must be sure to use the correct model integrity validator for 
the instance data at hand -- not a validator intended for some other 
version -- which means that the instance data should indicate the 
model-integrity validator under which it was created.

  - Suitability for use (defined by the consumer).  This depends on the 
consuming application, so it will differ between producer and consumer 
and between different consumers.  Since only the data consumer really 
knows how it will use the data it receives, it should supply a way to 
check suitability for use.  This may also include integrity checks that 
are essential to this consumer, but to avoid unnecessary coupling it 
should avoid any other checks.
]]

Thus, different suitability-for-use checks can be defined by different 
data consumers.

To my mind, this SPARQL-based approach is much more flexible than an 
OWL-like approach.

David Booth

  -

Received on Saturday, 6 April 2013 21:19:07 UTC