Re: Validation and documentation using SPIN templates examples and an extension for heuristics/expectations from Dimitris Kontokostas on 2014-07-24 (public-rdf-shapes@w3.org from July 2014)

From: Dimitris Kontokostas <kontokostas@informatik.uni-leipzig.de>
Date: Thu, 24 Jul 2014 11:59:42 +0300
To: Jerven Bolleman <jerven.bolleman@isb-sib.ch>
Cc: "public-rdf-sha." <public-rdf-shapes@w3.org>
Message-ID: <CA+u4+a1A=M0=hFHGmhzYxidL1SpJZYvmGs_L0coft4MZmbYzwg@mail.gmail.com>
On Thu, Jul 24, 2014 at 11:15 AM, Jerven Bolleman <
jerven.bolleman@isb-sib.ch> wrote:

> Dear All,
>
> I now see that there are two main desires from the community for the
> outcome of this WG process.
> The first is documenting what the data should look like,
> the second is validating that the data is correct.
>
> My first messages where about the validation of data being correct, this
> one is about what the data should look like.
> Some people have expressed the opinion that organizations already have a
> large infrastructure for validation but that
> they need better documentation today.
>
> In my opinion, that is formed in a large part by my experience in teaching
> RDF/SPARQL and OWL reasoning to interested novices.
>
> SPIN as it was presented is not nice for the first but is really great for
> the second.
> ICV is ok for the first and is good for the second.
> ShEx, just makes me sad... The readability of regular expressions with the
> verbosity of RDF is not a pleasant combination.
> Resource shapes, I have only glanced at.
>
> With a few examples I am going to try to explain the goals I currently
> think the WG should investigate (and have that part investigation goal be
> part of the Charter) and how SPIN with templates
> can achieve these goals. These examples are just for discussion and
> illustration purposes they are not a complete proposal and do not have an
> implementation.
>
> A problem with ShEx and ICV as is that it can only express hard
> constraints and makes documenting the why of these constraints hard.
> SPIN can describe hard constraints and soft/heuristics. For example lets
> say we have some data about Formula 1 cars. We want to say that all cars
> have 1 driver and 4 or 6 wheels. This is a hard constraint, as shown below
> in SPIN/template and ShEx syntax.
>
>
> prefix sp : <http://spinrdf.org/sp#">
> prefix spin : <http://spinrdf.org/spin#">
> prefix spl : <http://spinrdf.org/spl#">
> prefix formula : <http://example.org/example_ontology_about_formula_one#”>
>
> formula:Car a owl:Class .
>  spin:constraint [ a spl:Attribute ;
>                    spl:predicate formula:driver ;
>                    spl:valueType formula:Driver ;
>                    spl:count 1 ] ;
>  spin:constraint [ spl:union [ a spl:Attribute ;
>                                spl:predicate formula:wheels ;
>                                spl:valueType formula:Wheel ;
>                                spl:count 4 ],
>                              [ a spl:Attribute ;
>                                spl:predicate formula:wheels ;
>                                spl:valueType formula:Wheel ;
>                                spl:count 6 ] ] .
> So far straight forward and nothing unusual here.
> With some fine tuning this could be improved i.e. removing a few redundant
> triples.
> But it is quite consistent, one driver, 4 or 6 wheels. Here I try to do
> the same in ShEx.
>
> <FormulaOneCarShape> { a formula:Car,
>                       formula:driver @<DriverShape> ,
>                       ( formula:wheels @<WheelShape>{4,4} |
>                         formula:wheels @<WheelShape>{6,6} ) }
> <DriverShape> { a formula:Driver }
> <WheelShape> { a formula:Wheel }
>
> Difference between ShEx or SPIN here is 14 to 9 or 6 lines depending on
> layout.
> SPIN is more explicit and does not need custom syntax.
> i.e. its plain RDF. ShEx is more compact but is not compatible in
> any way with existing tools.
> spl:union is not yet an existing spin template but I think it can be done.
>
> However, this example is rather minimal and only deals with constraints.
> I suggest we extend this with soft/heuristics that look like this.
>
> formula:Car
>  spin:constraint [  a heuristics:veryFewHave ;
>                     ex:commonType :4WheelCar ;
>                     ex:rareType :6WheelCar ;
>                     rdfs:comment "The Tyrrel P34 had 4 front wheels and
> raced in 1976 and 1977, but it is the only known example" ;
>                     rdfs:seeAlso <http://en.wikipedia.org/wiki/Tyrrell_P34>
> ]
>
> :4WheelCar rdfs:subClassOf formula:Car ;
>  rdfs:subClassOf [ owl:restriction [ owl:onProperty formula:wheel ;
>                    owl:exactCardinality 4 ]] .
>
> :6WheelCar rdfs:subClassOf formula:Car ;
>  rdfs:subClassOf [ owl:restriction [ owl:onProperty formula:wheel ;
>                    owl:exactCardinality 6 ]] .
>
> The idea here is that it allows us to identify the common case and the
> exceptional, and document those. With side benefits that heuristics for
> data quality control can be triggered for them as well as optimizations if
> e.g. java code is generated from these Expectations. In the example while
> formula one cars can have four or six wheels the 6 wheel case is very rare,
> and if you ever have a database/message filled with six wheel formula one
> cars you should probably investigate.
>

One note here (i'm not going into syntax) if we adopt the severety level
paradigm this can be easily supported from both ShEx & SPIN with a rule
like this:
Rule X, "cars with six wheels are uncommon" @level warning (or notice)



>
> You can see that I use OWL here instead of more shapes as OWL is a great
> existing technology to determine the type of an instance given knowledge
> about its properties. OWL anonymous classes will also solve the issue of
> "typeless" constraints, which I expect will be very rare. So for most users
> knowing OWL would not be a requirement.
>
> One can imagine a an extension to Manchester Syntax that can encode this
> as well as the examples given here.
> But to be honest I would prefer the RDF syntax to be clean and straight
> forward for most uses. When I teach RDF, I always say everything can be
> expressed as triple, sometimes its verbose and awkward but it always works.
> Every single time I introduce a new syntax I put up a barrier for adoption
> and understanding. This is why I personally do not like OWL Manchester
> Syntax because it puts in place an artificial barrier between data and
> ontologies and divides a community that should be united. In a two day
> course I spend the first day explaining RDF
> and SPARQL, and the second day Reasoning and OWL. The second day I waste a
> lot of time when using Manchester Syntax and undermine my first day, which
> is why I use topbraid composer (free) and its RDF/turtle views to explain
> owl:restrictions instead of protege.
>
> I think all the heuristics constraints for expressing expected data
> distributions can be spin:templates
> e.g. something like this (please excuse syntax/logic errors and typos)
>
> heuristics:veryFewHave rdfs:subClassOf spin:Template ;
>  spin:constraint [ a spl:Argument ;
>                    rdfs:comment "The common super type" ;
>                    spl:predicate heuristic:commonType ;
>                    spl:valueType xsd:anyURI ] ;
>  spin:constraint [ a spl:Argument ;
>                    rdfs:comment "The rare type" ;
>                    spl:predicate heuristic:rareType ;
>                    spl:valueType xsd:anyURI ] ;
>  spin:text "CONSTRUCT {
>               [] a heuristics:HeuristicsViolation ;
>                  spin:violationRoot ?this ;
>                  spin:violationPath ?predicate
>                  rdfs:label ?label .
>             } WHERE {
>               {
>                 BIND((spl:objectCount(rdf:type, ?commonType)) AS
> ?commonCount)
>                 BIND((spl:objectCount(rdf:type, ?rareType)) AS ?rareCount)
>                 FILTER((?commonCount/?rareCount)  > 0.05)
>                 BIND(CONCAT("The type ", str(?rareType), " is more than 5%
> of ", str(?commonType)) as ?label)
>             }"
>
>
> This heuristics ontology/template library of concepts/thing for validation
> can of course be implemented using other technologies than SPIN. And while
> these templates should be standardized they are not part of the the "UI"
> for simple documentation and validation reasons.
>
> In conclusion, SPIN, in collaboration with its templates and reusing the
> existing OWL standard is at least as user friendly as ShEx and it has very
> good potential to document not just constraints but expectations. Showing
> that we can have both simple and expressive with one standard.
>
> Sincere regards,
> Jerven Bolleman
> -------------------------------------------------------------------
> Jerven Bolleman                        Jerven.Bolleman@isb-sib.ch
> SIB Swiss Institute of Bioinformatics      Tel: +41 (0)22 379 58 85
> CMU, rue Michel Servet 1               Fax: +41 (0)22 379 58 58
> 1211 Geneve 4,
> Switzerland     www.isb-sib.ch - www.uniprot.org
> Follow us at https://twitter.com/#!/uniprot
> -------------------------------------------------------------------
>
>
>
>


-- 
Dimitris Kontokostas
Department of Computer Science, University of Leipzig
Research Group: http://aksw.org
Homepage:http://aksw.org/DimitrisKontokostas
Received on Thursday, 24 July 2014 09:00:41 UTC