Re: Validation and documentation using SPIN templates examples and an extension for heuristics/expectations from Holger Knublauch on 2014-07-24 (public-rdf-shapes@w3.org from July 2014)

From: Holger Knublauch <holger@topquadrant.com>
Date: Thu, 24 Jul 2014 18:45:45 +1000
To: public-rdf-shapes@w3.org
Message-ID: <53D0C7B9.1090103@topquadrant.com>
Olivier,

not sure where your hostility comes from. Jerven has made clear that he 
has first-hand experience in teaching semantic technology to novices. 
That's a valid target audience, as are advanced ontologists like himself.

Regards,
Holger


On 7/24/14, 6:37 PM, Olivier Rossel wrote:
> Please please, do not decide by yourself that one option or the other 
> is user-friendly or readable or solves a real-life problem. Please 
> survey that with data definition people from outside our community 
> (i.e our target audience).
> IMHO, the point is to have a good idea of the tediousness vs 
> capabilites of all the available options.
>
>
>
> On Thu, Jul 24, 2014 at 10:15 AM, Jerven Bolleman 
> <jerven.bolleman@isb-sib.ch <mailto:jerven.bolleman@isb-sib.ch>> wrote:
>
>     Dear All,
>
>     I now see that there are two main desires from the community for
>     the outcome of this WG process.
>     The first is documenting what the data should look like,
>     the second is validating that the data is correct.
>
>     My first messages where about the validation of data being
>     correct, this one is about what the data should look like.
>     Some people have expressed the opinion that organizations already
>     have a large infrastructure for validation but that
>     they need better documentation today.
>
>     In my opinion, that is formed in a large part by my experience in
>     teaching RDF/SPARQL and OWL reasoning to interested novices.
>
>     SPIN as it was presented is not nice for the first but is really
>     great for the second.
>     ICV is ok for the first and is good for the second.
>     ShEx, just makes me sad... The readability of regular expressions
>     with the verbosity of RDF is not a pleasant combination.
>     Resource shapes, I have only glanced at.
>
>     With a few examples I am going to try to explain the goals I
>     currently think the WG should investigate (and have that part
>     investigation goal be part of the Charter) and how SPIN with templates
>     can achieve these goals. These examples are just for discussion
>     and illustration purposes they are not a complete proposal and do
>     not have an implementation.
>
>     A problem with ShEx and ICV as is that it can only express hard
>     constraints and makes documenting the why of these constraints hard.
>     SPIN can describe hard constraints and soft/heuristics. For
>     example lets say we have some data about Formula 1 cars. We want
>     to say that all cars have 1 driver and 4 or 6 wheels. This is a
>     hard constraint, as shown below in SPIN/template and ShEx syntax.
>
>
>     prefix sp : <http://spinrdf.org/sp#">
>     prefix spin : <http://spinrdf.org/spin#">
>     prefix spl : <http://spinrdf.org/spl#">
>     prefix formula :
>     <http://example.org/example_ontology_about_formula_one#”>
>
>     formula:Car a owl:Class .
>      spin:constraint [ a spl:Attribute ;
>                        spl:predicate formula:driver ;
>                        spl:valueType formula:Driver ;
>                        spl:count 1 ] ;
>      spin:constraint [ spl:union [ a spl:Attribute ;
>                                    spl:predicate formula:wheels ;
>                                    spl:valueType formula:Wheel ;
>                                    spl:count 4 ],
>                                  [ a spl:Attribute ;
>                                    spl:predicate formula:wheels ;
>                                    spl:valueType formula:Wheel ;
>                                    spl:count 6 ] ] .
>     So far straight forward and nothing unusual here.
>     With some fine tuning this could be improved i.e. removing a few
>     redundant triples.
>     But it is quite consistent, one driver, 4 or 6 wheels. Here I try
>     to do the same in ShEx.
>
>     <FormulaOneCarShape> { a formula:Car,
>                           formula:driver @<DriverShape> ,
>                           ( formula:wheels @<WheelShape>{4,4} |
>                             formula:wheels @<WheelShape>{6,6} ) }
>     <DriverShape> { a formula:Driver }
>     <WheelShape> { a formula:Wheel }
>
>     Difference between ShEx or SPIN here is 14 to 9 or 6 lines
>     depending on layout.
>     SPIN is more explicit and does not need custom syntax.
>     i.e. its plain RDF. ShEx is more compact but is not compatible in
>     any way with existing tools.
>     spl:union is not yet an existing spin template but I think it can
>     be done.
>
>     However, this example is rather minimal and only deals with
>     constraints.
>     I suggest we extend this with soft/heuristics that look like this.
>
>     formula:Car
>      spin:constraint [  a heuristics:veryFewHave ;
>                         ex:commonType :4WheelCar ;
>                         ex:rareType :6WheelCar ;
>                         rdfs:comment "The Tyrrel P34 had 4 front
>     wheels and raced in 1976 and 1977, but it is the only known example" ;
>                         rdfs:seeAlso
>     <http://en.wikipedia.org/wiki/Tyrrell_P34> ]
>
>     :4WheelCar rdfs:subClassOf formula:Car ;
>      rdfs:subClassOf [ owl:restriction [ owl:onProperty formula:wheel ;
>                        owl:exactCardinality 4 ]] .
>
>     :6WheelCar rdfs:subClassOf formula:Car ;
>      rdfs:subClassOf [ owl:restriction [ owl:onProperty formula:wheel ;
>                        owl:exactCardinality 6 ]] .
>
>     The idea here is that it allows us to identify the common case and
>     the exceptional, and document those. With side benefits that
>     heuristics for data quality control can be triggered for them as
>     well as optimizations if e.g. java code is generated from these
>     Expectations. In the example while formula one cars can have four
>     or six wheels the 6 wheel case is very rare, and if you ever have
>     a database/message filled with six wheel formula one cars you
>     should probably investigate.
>
>     You can see that I use OWL here instead of more shapes as OWL is a
>     great existing technology to determine the type of an instance
>     given knowledge about its properties. OWL anonymous classes will
>     also solve the issue of "typeless" constraints, which I expect
>     will be very rare. So for most users knowing OWL would not be a
>     requirement.
>
>     One can imagine a an extension to Manchester Syntax that can
>     encode this as well as the examples given here.
>     But to be honest I would prefer the RDF syntax to be clean and
>     straight forward for most uses. When I teach RDF, I always say
>     everything can be expressed as triple, sometimes its verbose and
>     awkward but it always works. Every single time I introduce a new
>     syntax I put up a barrier for adoption and understanding. This is
>     why I personally do not like OWL Manchester Syntax because it puts
>     in place an artificial barrier between data and ontologies and
>     divides a community that should be united. In a two day course I
>     spend the first day explaining RDF
>     and SPARQL, and the second day Reasoning and OWL. The second day I
>     waste a lot of time when using Manchester Syntax and undermine my
>     first day, which is why I use topbraid composer (free) and its
>     RDF/turtle views to explain owl:restrictions instead of protege.
>
>     I think all the heuristics constraints for expressing expected
>     data distributions can be spin:templates
>     e.g. something like this (please excuse syntax/logic errors and typos)
>
>     heuristics:veryFewHave rdfs:subClassOf spin:Template ;
>      spin:constraint [ a spl:Argument ;
>                        rdfs:comment "The common super type" ;
>                        spl:predicate heuristic:commonType ;
>                        spl:valueType xsd:anyURI ] ;
>      spin:constraint [ a spl:Argument ;
>                        rdfs:comment "The rare type" ;
>                        spl:predicate heuristic:rareType ;
>                        spl:valueType xsd:anyURI ] ;
>      spin:text "CONSTRUCT {
>                   [] a heuristics:HeuristicsViolation ;
>                      spin:violationRoot ?this ;
>                      spin:violationPath ?predicate
>                      rdfs:label ?label .
>                 } WHERE {
>                   {
>                     BIND((spl:objectCount(rdf:type, ?commonType)) AS
>     ?commonCount)
>                     BIND((spl:objectCount(rdf:type, ?rareType)) AS
>     ?rareCount)
>                     FILTER((?commonCount/?rareCount)  > 0.05)
>                     BIND(CONCAT("The type ", str(?rareType), " is more
>     than 5% of ", str(?commonType)) as ?label)
>                 }"
>
>
>     This heuristics ontology/template library of concepts/thing for
>     validation can of course be implemented using other technologies
>     than SPIN. And while these templates should be standardized they
>     are not part of the the "UI" for simple documentation and
>     validation reasons.
>
>     In conclusion, SPIN, in collaboration with its templates and
>     reusing the existing OWL standard is at least as user friendly as
>     ShEx and it has very good potential to document not just
>     constraints but expectations. Showing that we can have both simple
>     and expressive with one standard.
>
>     Sincere regards,
>     Jerven Bolleman
>     -------------------------------------------------------------------
>     Jerven Bolleman Jerven.Bolleman@isb-sib.ch
>     <mailto:Jerven.Bolleman@isb-sib.ch>
>     SIB Swiss Institute of Bioinformatics      Tel: +41 (0)22 379 58
>     85 <tel:%2B41%20%280%2922%20379%2058%2085>
>     CMU, rue Michel Servet 1               Fax: +41 (0)22 379 58 58
>     <tel:%2B41%20%280%2922%20379%2058%2058>
>     1211 Geneve 4,
>     Switzerland www.isb-sib.ch <http://www.isb-sib.ch> -
>     www.uniprot.org <http://www.uniprot.org>
>     Follow us at https://twitter.com/#!/uniprot
>     <https://twitter.com/#%21/uniprot>
>     -------------------------------------------------------------------
>
>
>
Received on Thursday, 24 July 2014 08:46:20 UTC