Re: Validation and documentation using SPIN templates examples and an extension for heuristics/expectations from Jose Emilio Labra Gayo on 2014-07-24 (public-rdf-shapes@w3.org from July 2014)

From: Jose Emilio Labra Gayo <jelabra@gmail.com>
Date: Thu, 24 Jul 2014 11:25:04 +0200
To: Paul <paul@proxml.be>
Cc: Holger Knublauch <holger@topquadrant.com>, "public-rdf-shapes@w3.org" <public-rdf-shapes@w3.org>
Message-ID: <CAJadXX+JwJxmY-iEuzrwzJXkj1w0d4M0QGgSTQ_mCrzUQAgJxA@mail.gmail.com>
I also think that what Olivier said is important and we have to take care
with some of the affirmations about usability/readability.

Saying: "ShEx is more compact but is not compatible in any way with
existing tools." is false as we have already shown that ShEx expressions
can be translated to SPARQL queries and there are 2 syntaxes (one compact
and another in RDF).

Furthermore, I have already said that ShEx expressions can play a very
interesting role complemented with other semantic web technologies like RDF
Schema, OWL, etc. when one wants to publish/consume linked data portals.

In my opinion, it can be seen as a DSL for RDF data shapes description and
validation...no more and no less...I even think that ShEx can be compatible
with SPIN/SPARQL in a similar way as in the world of XML, I have seen
solutions combining Schematron with RelaxNG.

I think ShEx expressions are very easy to understand by the intended
audience, which are people who want to consume/publish linked data portals.

I have created the technical documentation of 2 data portals using Shape
Expressions and the team of developers had no problem understanding them.
They were comprised by people which were not experts in RDF and they had to
generate RDF following the shapes described using ShEx. They could do the
job without any problem.

The documentation of the portals is here:

http://weso.github.io/landportalDoc/data/
http://weso.github.io/wiDoc/

Also, as you can see in the documentation, both portals are using a similar
model based on RDF Data Cube where the main entities are observations of
type qb:Observation.

However, the shapes of the different resources are different. For example,
in the LandPortal we use the time ontology while in the WebIndex we just
used years codified as integers. This can be seen as an example of why I
think it is better to separate types (which work at a more semantic level)
from shapes (which are more intended as interfaces between linked data
portals)

Best regards, Jose Labra


On Thu, Jul 24, 2014 at 10:55 AM, Paul <paul@proxml.be> wrote:

> Didn't find it hostile.
>
> He just warned that we must avoid jumping to conclusions what potential
> 'users' will prefer.
>
>
>
> Paul
>
> On 24 Jul 2014, at 10:45, Holger Knublauch <holger@topquadrant.com> wrote:
>
>  Olivier,
>
> not sure where your hostility comes from. Jerven has made clear that he
> has first-hand experience in teaching semantic technology to novices.
> That's a valid target audience, as are advanced ontologists like himself.
>
> Regards,
> Holger
>
>
> On 7/24/14, 6:37 PM, Olivier Rossel wrote:
>
> Please please, do not decide by yourself that one option or the other is
> user-friendly or readable or solves a real-life problem. Please survey that
> with data definition people from outside our community (i.e our target
> audience).
> IMHO, the point is to have a good idea of the tediousness vs capabilites
> of all the available options.
>
>
>
> On Thu, Jul 24, 2014 at 10:15 AM, Jerven Bolleman <
> jerven.bolleman@isb-sib.ch> wrote:
>
>> Dear All,
>>
>> I now see that there are two main desires from the community for the
>> outcome of this WG process.
>> The first is documenting what the data should look like,
>> the second is validating that the data is correct.
>>
>> My first messages where about the validation of data being correct, this
>> one is about what the data should look like.
>> Some people have expressed the opinion that organizations already have a
>> large infrastructure for validation but that
>> they need better documentation today.
>>
>> In my opinion, that is formed in a large part by my experience in
>> teaching RDF/SPARQL and OWL reasoning to interested novices.
>>
>> SPIN as it was presented is not nice for the first but is really great
>> for the second.
>> ICV is ok for the first and is good for the second.
>> ShEx, just makes me sad... The readability of regular expressions with
>> the verbosity of RDF is not a pleasant combination.
>> Resource shapes, I have only glanced at.
>>
>> With a few examples I am going to try to explain the goals I currently
>> think the WG should investigate (and have that part investigation goal be
>> part of the Charter) and how SPIN with templates
>> can achieve these goals. These examples are just for discussion and
>> illustration purposes they are not a complete proposal and do not have an
>> implementation.
>>
>> A problem with ShEx and ICV as is that it can only express hard
>> constraints and makes documenting the why of these constraints hard.
>> SPIN can describe hard constraints and soft/heuristics. For example lets
>> say we have some data about Formula 1 cars. We want to say that all cars
>> have 1 driver and 4 or 6 wheels. This is a hard constraint, as shown below
>> in SPIN/template and ShEx syntax.
>>
>>
>> prefix sp : <http://spinrdf.org/sp#">
>> prefix spin : <http://spinrdf.org/spin#">
>> prefix spl : <http://spinrdf.org/spl#">
>> prefix formula : <http://example.org/example_ontology_about_formula_one#
>> ”>
>>
>> formula:Car a owl:Class .
>>  spin:constraint [ a spl:Attribute ;
>>                    spl:predicate formula:driver ;
>>                    spl:valueType formula:Driver ;
>>                    spl:count 1 ] ;
>>  spin:constraint [ spl:union [ a spl:Attribute ;
>>                                spl:predicate formula:wheels ;
>>                                spl:valueType formula:Wheel ;
>>                                spl:count 4 ],
>>                              [ a spl:Attribute ;
>>                                spl:predicate formula:wheels ;
>>                                spl:valueType formula:Wheel ;
>>                                spl:count 6 ] ] .
>> So far straight forward and nothing unusual here.
>> With some fine tuning this could be improved i.e. removing a few
>> redundant triples.
>> But it is quite consistent, one driver, 4 or 6 wheels. Here I try to do
>> the same in ShEx.
>>
>> <FormulaOneCarShape> { a formula:Car,
>>                       formula:driver @<DriverShape> ,
>>                       ( formula:wheels @<WheelShape>{4,4} |
>>                         formula:wheels @<WheelShape>{6,6} ) }
>> <DriverShape> { a formula:Driver }
>> <WheelShape> { a formula:Wheel }
>>
>> Difference between ShEx or SPIN here is 14 to 9 or 6 lines depending on
>> layout.
>> SPIN is more explicit and does not need custom syntax.
>> i.e. its plain RDF. ShEx is more compact but is not compatible in
>> any way with existing tools.
>> spl:union is not yet an existing spin template but I think it can be done.
>>
>> However, this example is rather minimal and only deals with constraints.
>> I suggest we extend this with soft/heuristics that look like this.
>>
>> formula:Car
>>  spin:constraint [  a heuristics:veryFewHave ;
>>                     ex:commonType :4WheelCar ;
>>                     ex:rareType :6WheelCar ;
>>                     rdfs:comment "The Tyrrel P34 had 4 front wheels and
>> raced in 1976 and 1977, but it is the only known example" ;
>>                     rdfs:seeAlso <
>> http://en.wikipedia.org/wiki/Tyrrell_P34> ]
>>
>> :4WheelCar rdfs:subClassOf formula:Car ;
>>  rdfs:subClassOf [ owl:restriction [ owl:onProperty formula:wheel ;
>>                    owl:exactCardinality 4 ]] .
>>
>> :6WheelCar rdfs:subClassOf formula:Car ;
>>  rdfs:subClassOf [ owl:restriction [ owl:onProperty formula:wheel ;
>>                    owl:exactCardinality 6 ]] .
>>
>> The idea here is that it allows us to identify the common case and the
>> exceptional, and document those. With side benefits that heuristics for
>> data quality control can be triggered for them as well as optimizations if
>> e.g. java code is generated from these Expectations. In the example while
>> formula one cars can have four or six wheels the 6 wheel case is very rare,
>> and if you ever have a database/message filled with six wheel formula one
>> cars you should probably investigate.
>>
>> You can see that I use OWL here instead of more shapes as OWL is a great
>> existing technology to determine the type of an instance given knowledge
>> about its properties. OWL anonymous classes will also solve the issue of
>> "typeless" constraints, which I expect will be very rare. So for most users
>> knowing OWL would not be a requirement.
>>
>> One can imagine a an extension to Manchester Syntax that can encode this
>> as well as the examples given here.
>> But to be honest I would prefer the RDF syntax to be clean and straight
>> forward for most uses. When I teach RDF, I always say everything can be
>> expressed as triple, sometimes its verbose and awkward but it always works.
>> Every single time I introduce a new syntax I put up a barrier for adoption
>> and understanding. This is why I personally do not like OWL Manchester
>> Syntax because it puts in place an artificial barrier between data and
>> ontologies and divides a community that should be united. In a two day
>> course I spend the first day explaining RDF
>> and SPARQL, and the second day Reasoning and OWL. The second day I waste
>> a lot of time when using Manchester Syntax and undermine my first day,
>> which is why I use topbraid composer (free) and its RDF/turtle views to
>> explain owl:restrictions instead of protege.
>>
>> I think all the heuristics constraints for expressing expected data
>> distributions can be spin:templates
>> e.g. something like this (please excuse syntax/logic errors and typos)
>>
>> heuristics:veryFewHave rdfs:subClassOf spin:Template ;
>>  spin:constraint [ a spl:Argument ;
>>                    rdfs:comment "The common super type" ;
>>                    spl:predicate heuristic:commonType ;
>>                    spl:valueType xsd:anyURI ] ;
>>  spin:constraint [ a spl:Argument ;
>>                    rdfs:comment "The rare type" ;
>>                    spl:predicate heuristic:rareType ;
>>                    spl:valueType xsd:anyURI ] ;
>>  spin:text "CONSTRUCT {
>>               [] a heuristics:HeuristicsViolation ;
>>                  spin:violationRoot ?this ;
>>                  spin:violationPath ?predicate
>>                  rdfs:label ?label .
>>             } WHERE {
>>               {
>>                 BIND((spl:objectCount(rdf:type, ?commonType)) AS
>> ?commonCount)
>>                 BIND((spl:objectCount(rdf:type, ?rareType)) AS ?rareCount)
>>                 FILTER((?commonCount/?rareCount)  > 0.05)
>>                 BIND(CONCAT("The type ", str(?rareType), " is more than
>> 5% of ", str(?commonType)) as ?label)
>>             }"
>>
>>
>> This heuristics ontology/template library of concepts/thing for
>> validation can of course be implemented using other technologies than SPIN.
>> And while these templates should be standardized they are not part of the
>> the "UI" for simple documentation and validation reasons.
>>
>> In conclusion, SPIN, in collaboration with its templates and reusing the
>> existing OWL standard is at least as user friendly as ShEx and it has very
>> good potential to document not just constraints but expectations. Showing
>> that we can have both simple and expressive with one standard.
>>
>> Sincere regards,
>> Jerven Bolleman
>> -------------------------------------------------------------------
>> Jerven Bolleman                        Jerven.Bolleman@isb-sib.ch
>> SIB Swiss Institute of Bioinformatics      Tel: +41 (0)22 379 58 85
>> CMU, rue Michel Servet 1               Fax: +41 (0)22 379 58 58
>> 1211 Geneve 4,
>> Switzerland     www.isb-sib.ch - www.uniprot.org
>> Follow us at https://twitter.com/#!/uniprot
>> -------------------------------------------------------------------
>>
>>
>>
>
>
>
> Kind Regards,
> Paul Hermans
>
> -------------------------
> *ProXML bvba*
> *Linked Data services*
> *(w) www.proxml.be <http://www.proxml.be/>*
>  *(e) paul@proxml.be <paul@proxml.be>*
>  *(tw)  @PaulZH*
> *(t)  +32 15 23 00 76 <%2B32%2015%2023%2000%2076>*
>  (m) +32 473 66 03 20
>
>
>
>
>
>


-- 
Saludos, Labra
Received on Thursday, 24 July 2014 09:25:52 UTC