Validation and documentation using SPIN templates examples and an extension for heuristics/expectations

Dear All,

I now see that there are two main desires from the community for the outcome of this WG process.
The first is documenting what the data should look like,
the second is validating that the data is correct.

My first messages where about the validation of data being correct, this one is about what the data should look like.
Some people have expressed the opinion that organizations already have a large infrastructure for validation but that
they need better documentation today.

In my opinion, that is formed in a large part by my experience in teaching RDF/SPARQL and OWL reasoning to interested novices.

SPIN as it was presented is not nice for the first but is really great for the second.
ICV is ok for the first and is good for the second.
ShEx, just makes me sad... The readability of regular expressions with the verbosity of RDF is not a pleasant combination.
Resource shapes, I have only glanced at.

With a few examples I am going to try to explain the goals I currently think the WG should investigate (and have that part investigation goal be part of the Charter) and how SPIN with templates
can achieve these goals. These examples are just for discussion and illustration purposes they are not a complete proposal and do not have an implementation.

A problem with ShEx and ICV as is that it can only express hard constraints and makes documenting the why of these constraints hard.
SPIN can describe hard constraints and soft/heuristics. For example lets say we have some data about Formula 1 cars. We want to say that all cars have 1 driver and 4 or 6 wheels. This is a hard constraint, as shown below in SPIN/template and ShEx syntax.


prefix sp : <http://spinrdf.org/sp#">
prefix spin : <http://spinrdf.org/spin#">
prefix spl : <http://spinrdf.org/spl#">
prefix formula : <http://example.org/example_ontology_about_formula_one#”>

formula:Car a owl:Class .
 spin:constraint [ a spl:Attribute ;
                   spl:predicate formula:driver ;
                   spl:valueType formula:Driver ;
                   spl:count 1 ] ;
 spin:constraint [ spl:union [ a spl:Attribute ;
                               spl:predicate formula:wheels ;
                               spl:valueType formula:Wheel ;
                               spl:count 4 ],
                             [ a spl:Attribute ;
                               spl:predicate formula:wheels ;
                               spl:valueType formula:Wheel ;
                               spl:count 6 ] ] .
So far straight forward and nothing unusual here.
With some fine tuning this could be improved i.e. removing a few redundant triples.
But it is quite consistent, one driver, 4 or 6 wheels. Here I try to do the same in ShEx.

<FormulaOneCarShape> { a formula:Car,
                      formula:driver @<DriverShape> ,
                      ( formula:wheels @<WheelShape>{4,4} |
                        formula:wheels @<WheelShape>{6,6} ) }
<DriverShape> { a formula:Driver }
<WheelShape> { a formula:Wheel }

Difference between ShEx or SPIN here is 14 to 9 or 6 lines depending on layout.
SPIN is more explicit and does not need custom syntax.
i.e. its plain RDF. ShEx is more compact but is not compatible in
any way with existing tools.
spl:union is not yet an existing spin template but I think it can be done.

However, this example is rather minimal and only deals with constraints.
I suggest we extend this with soft/heuristics that look like this.

formula:Car
 spin:constraint [  a heuristics:veryFewHave ;
                    ex:commonType :4WheelCar ;
                    ex:rareType :6WheelCar ;
                    rdfs:comment "The Tyrrel P34 had 4 front wheels and raced in 1976 and 1977, but it is the only known example" ;
                    rdfs:seeAlso <http://en.wikipedia.org/wiki/Tyrrell_P34> ]

:4WheelCar rdfs:subClassOf formula:Car ;
 rdfs:subClassOf [ owl:restriction [ owl:onProperty formula:wheel ;
                   owl:exactCardinality 4 ]] .

:6WheelCar rdfs:subClassOf formula:Car ;
 rdfs:subClassOf [ owl:restriction [ owl:onProperty formula:wheel ;
                   owl:exactCardinality 6 ]] .

The idea here is that it allows us to identify the common case and the exceptional, and document those. With side benefits that heuristics for data quality control can be triggered for them as well as optimizations if e.g. java code is generated from these Expectations. In the example while formula one cars can have four or six wheels the 6 wheel case is very rare, and if you ever have a database/message filled with six wheel formula one cars you should probably investigate.

You can see that I use OWL here instead of more shapes as OWL is a great existing technology to determine the type of an instance given knowledge about its properties. OWL anonymous classes will also solve the issue of "typeless" constraints, which I expect will be very rare. So for most users knowing OWL would not be a requirement.

One can imagine a an extension to Manchester Syntax that can encode this as well as the examples given here.
But to be honest I would prefer the RDF syntax to be clean and straight forward for most uses. When I teach RDF, I always say everything can be expressed as triple, sometimes its verbose and awkward but it always works. Every single time I introduce a new syntax I put up a barrier for adoption and understanding. This is why I personally do not like OWL Manchester Syntax because it puts in place an artificial barrier between data and ontologies and divides a community that should be united. In a two day course I spend the first day explaining RDF
and SPARQL, and the second day Reasoning and OWL. The second day I waste a lot of time when using Manchester Syntax and undermine my first day, which is why I use topbraid composer (free) and its RDF/turtle views to explain owl:restrictions instead of protege.

I think all the heuristics constraints for expressing expected data distributions can be spin:templates
e.g. something like this (please excuse syntax/logic errors and typos)

heuristics:veryFewHave rdfs:subClassOf spin:Template ;
 spin:constraint [ a spl:Argument ;
                   rdfs:comment "The common super type" ;
                   spl:predicate heuristic:commonType ;
                   spl:valueType xsd:anyURI ] ;
 spin:constraint [ a spl:Argument ;
                   rdfs:comment "The rare type" ;
                   spl:predicate heuristic:rareType ;
                   spl:valueType xsd:anyURI ] ;
 spin:text "CONSTRUCT {
              [] a heuristics:HeuristicsViolation ;
                 spin:violationRoot ?this ;
                 spin:violationPath ?predicate
                 rdfs:label ?label .
            } WHERE {
              {
                BIND((spl:objectCount(rdf:type, ?commonType)) AS ?commonCount) 
                BIND((spl:objectCount(rdf:type, ?rareType)) AS ?rareCount)
                FILTER((?commonCount/?rareCount)  > 0.05)
                BIND(CONCAT("The type ", str(?rareType), " is more than 5% of ", str(?commonType)) as ?label)
            }"


This heuristics ontology/template library of concepts/thing for validation can of course be implemented using other technologies than SPIN. And while these templates should be standardized they are not part of the the "UI" for simple documentation and validation reasons.

In conclusion, SPIN, in collaboration with its templates and reusing the existing OWL standard is at least as user friendly as ShEx and it has very good potential to document not just constraints but expectations. Showing that we can have both simple and expressive with one standard.

Sincere regards,
Jerven Bolleman
-------------------------------------------------------------------
Jerven Bolleman                        Jerven.Bolleman@isb-sib.ch
SIB Swiss Institute of Bioinformatics      Tel: +41 (0)22 379 58 85
CMU, rue Michel Servet 1               Fax: +41 (0)22 379 58 58
1211 Geneve 4,
Switzerland     www.isb-sib.ch - www.uniprot.org
Follow us at https://twitter.com/#!/uniprot
-------------------------------------------------------------------

Received on Thursday, 24 July 2014 08:15:44 UTC