Re: Validation and documentation using SPIN templates examples and an extension for heuristics/expectations from Jerven Tjalling Bolleman on 2014-07-24 (public-rdf-shapes@w3.org from July 2014)

From: Jerven Tjalling Bolleman <jerven.bolleman@isb-sib.ch>
Date: Thu, 24 Jul 2014 13:16:07 +0200
To: Olivier Rossel <olivier.rossel@gmail.com>
CC: "public-rdf-shapes@w3.org" <public-rdf-shapes@w3.org>
Message-ID: <53D0EAF7.6030101@isb-sib.ch>
Hi Olivier,
On 24/07/14 10:37, Olivier Rossel wrote:
> Please please, do not decide by yourself that one option or the other is
> user-friendly or readable or solves a real-life problem. Please survey
> that with data definition people from outside our community (i.e our
> target audience).
> IMHO, the point is to have a good idea of the tediousness vs capabilites
> of all the available options.

I actually think that we are very much on agreement with this. The 
reason I got involved at all, on the mailing list, is because I felt 
that instead of closely looking at existing solutions a new option was 
being designed from scratch.

I do apologize to all if I seem to say "SPIN or the highway". What I 
really am trying to say is are you sure SPIN isn't solving your problems 
already? Because its solving mine already and what I am seeing in the 
other proposals currently won't.

I also had the feeling that this pre-workgroup was heading towards "ShEX 
or the highway". Which is why I reacted so strongly. This is not as 
unreasonable as it seems e.g. I would also like to point out that until 
this month almost all mails on this list have been about ShEx. Not about 
the charter, or about requirements, or alternatives. The Charter 
currently only names ShEX as an option. And closes of closed world owl 
as a non recommendation track item.

And don't worry my decisions are not binding on anyone else ;)

Regards,
Jerven
>
>
>
> On Thu, Jul 24, 2014 at 10:15 AM, Jerven Bolleman
> <jerven.bolleman@isb-sib.ch <mailto:jerven.bolleman@isb-sib.ch>> wrote:
>
>     Dear All,
>
>     I now see that there are two main desires from the community for the
>     outcome of this WG process.
>     The first is documenting what the data should look like,
>     the second is validating that the data is correct.
>
>     My first messages where about the validation of data being correct,
>     this one is about what the data should look like.
>     Some people have expressed the opinion that organizations already
>     have a large infrastructure for validation but that
>     they need better documentation today.
>
>     In my opinion, that is formed in a large part by my experience in
>     teaching RDF/SPARQL and OWL reasoning to interested novices.
>
>     SPIN as it was presented is not nice for the first but is really
>     great for the second.
>     ICV is ok for the first and is good for the second.
>     ShEx, just makes me sad... The readability of regular expressions
>     with the verbosity of RDF is not a pleasant combination.
>     Resource shapes, I have only glanced at.
>
>     With a few examples I am going to try to explain the goals I
>     currently think the WG should investigate (and have that part
>     investigation goal be part of the Charter) and how SPIN with templates
>     can achieve these goals. These examples are just for discussion and
>     illustration purposes they are not a complete proposal and do not
>     have an implementation.
>
>     A problem with ShEx and ICV as is that it can only express hard
>     constraints and makes documenting the why of these constraints hard.
>     SPIN can describe hard constraints and soft/heuristics. For example
>     lets say we have some data about Formula 1 cars. We want to say that
>     all cars have 1 driver and 4 or 6 wheels. This is a hard constraint,
>     as shown below in SPIN/template and ShEx syntax.
>
>
>     prefix sp : <http://spinrdf.org/sp#">
>     prefix spin : <http://spinrdf.org/spin#">
>     prefix spl : <http://spinrdf.org/spl#">
>     prefix formula :
>     <http://example.org/example_ontology_about_formula_one#”>
>
>     formula:Car a owl:Class .
>       spin:constraint [ a spl:Attribute ;
>                         spl:predicate formula:driver ;
>                         spl:valueType formula:Driver ;
>                         spl:count 1 ] ;
>       spin:constraint [ spl:union [ a spl:Attribute ;
>                                     spl:predicate formula:wheels ;
>                                     spl:valueType formula:Wheel ;
>                                     spl:count 4 ],
>                                   [ a spl:Attribute ;
>                                     spl:predicate formula:wheels ;
>                                     spl:valueType formula:Wheel ;
>                                     spl:count 6 ] ] .
>     So far straight forward and nothing unusual here.
>     With some fine tuning this could be improved i.e. removing a few
>     redundant triples.
>     But it is quite consistent, one driver, 4 or 6 wheels. Here I try to
>     do the same in ShEx.
>
>     <FormulaOneCarShape> { a formula:Car,
>                            formula:driver @<DriverShape> ,
>                            ( formula:wheels @<WheelShape>{4,4} |
>                              formula:wheels @<WheelShape>{6,6} ) }
>     <DriverShape> { a formula:Driver }
>     <WheelShape> { a formula:Wheel }
>
>     Difference between ShEx or SPIN here is 14 to 9 or 6 lines depending
>     on layout.
>     SPIN is more explicit and does not need custom syntax.
>     i.e. its plain RDF. ShEx is more compact but is not compatible in
>     any way with existing tools.
>     spl:union is not yet an existing spin template but I think it can be
>     done.
>
>     However, this example is rather minimal and only deals with constraints.
>     I suggest we extend this with soft/heuristics that look like this.
>
>     formula:Car
>       spin:constraint [  a heuristics:veryFewHave ;
>                          ex:commonType :4WheelCar ;
>                          ex:rareType :6WheelCar ;
>                          rdfs:comment "The Tyrrel P34 had 4 front wheels
>     and raced in 1976 and 1977, but it is the only known example" ;
>                          rdfs:seeAlso
>     <http://en.wikipedia.org/wiki/Tyrrell_P34> ]
>
>     :4WheelCar rdfs:subClassOf formula:Car ;
>       rdfs:subClassOf [ owl:restriction [ owl:onProperty formula:wheel ;
>                         owl:exactCardinality 4 ]] .
>
>     :6WheelCar rdfs:subClassOf formula:Car ;
>       rdfs:subClassOf [ owl:restriction [ owl:onProperty formula:wheel ;
>                         owl:exactCardinality 6 ]] .
>
>     The idea here is that it allows us to identify the common case and
>     the exceptional, and document those. With side benefits that
>     heuristics for data quality control can be triggered for them as
>     well as optimizations if e.g. java code is generated from these
>     Expectations. In the example while formula one cars can have four or
>     six wheels the 6 wheel case is very rare, and if you ever have a
>     database/message filled with six wheel formula one cars you should
>     probably investigate.
>
>     You can see that I use OWL here instead of more shapes as OWL is a
>     great existing technology to determine the type of an instance given
>     knowledge about its properties. OWL anonymous classes will also
>     solve the issue of "typeless" constraints, which I expect will be
>     very rare. So for most users knowing OWL would not be a requirement.
>
>     One can imagine a an extension to Manchester Syntax that can encode
>     this as well as the examples given here.
>     But to be honest I would prefer the RDF syntax to be clean and
>     straight forward for most uses. When I teach RDF, I always say
>     everything can be expressed as triple, sometimes its verbose and
>     awkward but it always works. Every single time I introduce a new
>     syntax I put up a barrier for adoption and understanding. This is
>     why I personally do not like OWL Manchester Syntax because it puts
>     in place an artificial barrier between data and ontologies and
>     divides a community that should be united. In a two day course I
>     spend the first day explaining RDF
>     and SPARQL, and the second day Reasoning and OWL. The second day I
>     waste a lot of time when using Manchester Syntax and undermine my
>     first day, which is why I use topbraid composer (free) and its
>     RDF/turtle views to explain owl:restrictions instead of protege.
>
>     I think all the heuristics constraints for expressing expected data
>     distributions can be spin:templates
>     e.g. something like this (please excuse syntax/logic errors and typos)
>
>     heuristics:veryFewHave rdfs:subClassOf spin:Template ;
>       spin:constraint [ a spl:Argument ;
>                         rdfs:comment "The common super type" ;
>                         spl:predicate heuristic:commonType ;
>                         spl:valueType xsd:anyURI ] ;
>       spin:constraint [ a spl:Argument ;
>                         rdfs:comment "The rare type" ;
>                         spl:predicate heuristic:rareType ;
>                         spl:valueType xsd:anyURI ] ;
>       spin:text "CONSTRUCT {
>                    [] a heuristics:HeuristicsViolation ;
>                       spin:violationRoot ?this ;
>                       spin:violationPath ?predicate
>                       rdfs:label ?label .
>                  } WHERE {
>                    {
>                      BIND((spl:objectCount(rdf:type, ?commonType)) AS
>     ?commonCount)
>                      BIND((spl:objectCount(rdf:type, ?rareType)) AS
>     ?rareCount)
>                      FILTER((?commonCount/?rareCount)  > 0.05)
>                      BIND(CONCAT("The type ", str(?rareType), " is more
>     than 5% of ", str(?commonType)) as ?label)
>                  }"
>
>
>     This heuristics ontology/template library of concepts/thing for
>     validation can of course be implemented using other technologies
>     than SPIN. And while these templates should be standardized they are
>     not part of the the "UI" for simple documentation and validation
>     reasons.
>
>     In conclusion, SPIN, in collaboration with its templates and reusing
>     the existing OWL standard is at least as user friendly as ShEx and
>     it has very good potential to document not just constraints but
>     expectations. Showing that we can have both simple and expressive
>     with one standard.
>
>     Sincere regards,
>     Jerven Bolleman
>     -------------------------------------------------------------------
>     Jerven Bolleman Jerven.Bolleman@isb-sib.ch
>     <mailto:Jerven.Bolleman@isb-sib.ch>
>     SIB Swiss Institute of Bioinformatics      Tel: +41 (0)22 379 58 85
>     <tel:%2B41%20%280%2922%20379%2058%2085>
>     CMU, rue Michel Servet 1               Fax: +41 (0)22 379 58 58
>     <tel:%2B41%20%280%2922%20379%2058%2058>
>     1211 Geneve 4,
>     Switzerland www.isb-sib.ch <http://www.isb-sib.ch> - www.uniprot.org
>     <http://www.uniprot.org>
>     Follow us at https://twitter.com/#!/uniprot
>     -------------------------------------------------------------------
>
>
>
Received on Thursday, 24 July 2014 11:17:10 UTC