Re: Shapes/ShEx or the worrying issue of yet another syntax and lack of validated vision. from Dimitris Kontokostas on 2014-07-18 (public-rdf-shapes@w3.org from July 2014)

From: Dimitris Kontokostas <kontokostas@informatik.uni-leipzig.de>
Date: Fri, 18 Jul 2014 10:58:35 +0300
To: Jose Emilio Labra Gayo <jelabra@gmail.com>
Cc: Jerven Bolleman <jerven.bolleman@isb-sib.ch>, "Dam, Jesse van" <jesse.vandam@wur.nl>, "public-rdf-shapes@w3.org" <public-rdf-shapes@w3.org>
Message-ID: <CA+u4+a2MpO7kxKABR5ffBvChRNUUBh-d_oaAS20yR7PUafSaWA@mail.gmail.com>
On Thu, Jul 17, 2014 at 6:08 AM, Jose Emilio Labra Gayo <jelabra@gmail.com>
wrote:

> Looking at the XML world, this discussion remembers me the difference
> between Schematron and RelaxNG. Shape Expressions play a role similar to
> RelaxNG for RDF, while SPIN/SPARQL plays a role similar to
> Schematron/XPath.
>
> In the XML world, both technologies have their uses and are complementary.
> I think something similar can happen in the RDF world.
>
> I would also separate RDF validation in general from RDF Shape checking.
> RDF shape cheking can be one of the steps for RDF validation, but it is not
> intended to be the only step that one has to take to validate a whole RDF
> based solution.
>

I think the overall goal should be to create an RDF validation language
that captures most common use cases and allows the use of e.g. SPARQL for
complex scenarios (as Arthur also mentioned).
The same way we have a single SPARQL standard and different implementations
we should focus on making ShEx as generally acceptable and SPIN, ICV or
anyone else can build their products with ShEx (or whatever name comes out
of this) as a front end. Otherwise, this will lead to market segmentation
and this effort will not have enough impact.

So instead of saying we want ShEx to focus only on X, we should gather all
use-cases / requirements (there was a mail about this) and see what will be
the standard rules, and what will have bindings to SPARQL.

My approach would be cover common published use cases from e.g. skos and
prov (or Karen's GLAM cases) and focus on representing their constraints in
ShEx.
Giving the vocabulary / ontology maintainers a standard way of representing
their constraints should be a first priority goal and should provide enough
impact for this effort.

side note: One of the great things with RDF and SPARQL is the well
preserved standards and that all vendors respect them and this is what
makes our RDF-based applications vendor-agnostic. Let's try to do the same
with validation

Best,
Dimtiris


>
> If you look at our presentation for the RDF validation workshop [1], you
> will see that we described a SPARQL based validation system that was able
> to check even statistical computations. That was not intended to be part of
> any validation standard, but more as an example of the SPARQL
> expressiveness. An RDF shape validation language will never validate
> statistical computations expressed in RDF by itself, but it can be
> complemented with other tools to help in the whole validation process.
>
> In any case, I think it is very helpful to identify common use cases like
> the ones that you described so a working group can decide which ones can be
> covered and which ones cannot.
>
> Best regards, Jose Labra
>
> [1] Validating statistical index data represented in RDF using SPARQL
> queries, Jose Labra and Jose María Álvarez Rodríguez, W3c Validation
> Workshop, Boston, 2013
>
> http://www.w3.org/2001/sw/wiki/images/d/d4/ValidatingStatisticalIndexData.pdf
>
> On Wed, Jul 16, 2014 at 10:40 PM, Jerven Bolleman <
> jerven.bolleman@isb-sib.ch> wrote:
>
>> Dear Jesse, All,
>>
>> I will try to formulate what I think is needed for validation in RDF. And
>> am ignoring some other voiced requirements.
>> The issue of documentation about what data is present in a  RDF file or
>> sparql endpoint is
>> nicely covered by void, an underused ontology (See also the HCLS note on
>> describing datasets [1]) and I would love to see more UI work done on large
>> void files.
>>
>> 1. Validation must be considered with its results, rarely is validation
>> really a boolean value, almost always more actions are required after
>> getting invalid data.
>> 2. Validation of information must also be considered in an information
>> system not only on a document level.
>> 3. Validation is an area where the details matter. We need to express
>> acceptable ranges for literals, and IRI patterns in relation to anything.
>>
>> Need 1.
>>
>> Take my earlier example a citizen service number (BSN) is unique and
>> should only occur once in the database.
>> Validation failure because a non unique citizen service number is added.
>> Both values could be correct, the first one or the second one, no one
>> knows.
>> In practice what it means is that my validator needs to infer that a
>> “Citizen Service Number Consolidation Action” is required.
>> Same is true for simpler use cases, for example a simple data upload,
>> with a missing field.
>> Its not enough to say “INVALID”, you need to say “User needs to fix
>> missing field”.
>>
>> Need 2.
>>
>> Again taking the BSN use case, the validation procedure needed to talk
>> with a central system to enable real validation to occur.
>> For example in the intake screen there is one field for BSN, always gets
>> filled in.
>> However, we need to check that names and addresses matched what is in the
>> other system, to see if it was really valid in
>> comparison to seeing if all the check boxes are ticked.
>>
>> Need 3.
>>
>> The BSN does not just exist, it has a valid string form, including a
>> checksum. This is BSN would be inside a IRI in RDF
>> e.g. <http://rijksoverheid.nl/burger/111222333> We need to check that it
>> has a certain check digit.
>>
>> Consequence 1.
>>
>> Something like 'SPIN rule' is almost more important than 'SPIN
>> constraint' for dealing with results of a validation.
>>
>> Consequence 2.
>>
>> Validation should be about RDF in federalised data systems. Something
>> like SPARQL service is a very real need.
>>
>> Consequence 3.
>>
>> Being able to actively investigate inside literals and IRIs is very
>> important. Standard set of functions that can be used to
>> express things like “check_digits” would be nice to have. Possibility to
>> add functions or magic properties to software as needed
>> or JS action in ShEX is a real need for developers.
>>
>> I would like to note that SPIN can meet all these needs today. And having
>> written more than 300 SPIN constraints and
>> having nearly 1700 SPIN rules I think I know where its a pain and where
>> it excels.
>>
>> I am not opposed to a compact syntax, but I would rather see a choice for
>> an “extended" syntax.
>> Add to turtle what is needed to meet the mission, don’t start from
>> scratch. (JSON-LD would be fine as well, or heck RDF/XML with sugar)
>>
>> Separate design thought.
>>
>> I think the charter should keep in mind the following. Make it easy to
>> ignore extra triples, and make it hard to
>> disallow such extra triples. The XML schema world closed of it
>> eXtensibilty with to many rigid schema’s. e.g.
>> everyone should have a given and family name, and if they ever write down
>> their nick name we are going to refuse all information,
>> instead of just taking what we needed.
>>
>> Regards,
>> Jerven
>>
>> PS. Jesse I will pass on your compliment about the schema document at
>> http://beta.sparql.uniprot.org/taxonomy to Leyla and Sebastien who
>> spend a lot of effort on getting these right.
>>
>> [1]
>> http://htmlpreview.github.io/?https://github.com/joejimbo/HCLSDatasetDescriptions/blob/master/Overview.html
>>
>> On 16 Jul 2014, at 15:16, Dam, Jesse van <jesse.vandam@wur.nl> wrote:
>>
>> > Hi,
>> >
>> > This is my opinion and comments based on some off the concerns raised
>> by Jerven Bollemand and Holger  KnubLauch.
>> >
>> > In my opinion a schema language for graph databases is missing. We need
>> a schema language that can do the same thing as the other schema languages
>> do for XML.
>> > A list of useful usages of these schema languages can be found on the
>> net and it is clear that the solution created for XML files called RELAX-NG
>> is used and successful. To me its clear that such a solution is missing in
>> the world of graph database and if created will used a lot.
>> >
>> > I, however, agree that the list intended usages on the SHEX wiki is
>> incomplete and that the statement "validate RDF documents" is to broad as
>> validation can be done for different kind of purposes. Shex will most
>> likely not going to cover all this purposes, but SHEX +SPIN will probably
>> will do.
>> >
>> > The reasons stated above is why SHEX is heavily inspired by RELAX-NG.
>> However when translating this to the semantic web we have to take into
>> consideration other existing standards in the semantic web, which include
>> SPIN, OSLC ResourceShapes project, RDFS, OWL and SADI.
>> >
>> > SHEX is inspired by regular expressions and RELAX-NG, which was also
>> inspired by regular expression. So that why a lot of time already has been
>> spent to try to synchronize SHEX as much as possible with these sources.
>> This effort already resulted in 3 different validation methods/semantics,
>> of which the last one is directly inspired by RELAX-NG regular expression
>> derivates. (with thanks to Jose Emilioi Labra Gayo).
>> > Effort are done to synchronize as much as possible with regular
>> expression, so please let us know if you think there is an option to better
>> synchronize it. As you noted it is not.
>> >> Did you notice that your use of the question mark is not consistent
>> with any other commonly used syntax e.g. egex, globs, trinary logic etc..
>> For sure leading to a lot of confusion.
>> > I recommend you to take a look at this page
>> http://www.w3.org/2013/ShEx/EvaluationLogic.html.
>> >
>> > Now looking at SPIN. I do agree we should align as much as possible
>> with SPIN and use SPIN where possible. Note that we are not building a
>> validation language that can do all, but captures that what RELAX-NG can do
>> for xml as SPIN is already existing that can do the complex things.
>> > That why there is an inclusion of semantic action, which can be defined
>> in SPARQL and if everything is converted to RDF it will automatically
>> become SPIN (No mentioning of that on yet as SHEX is still under
>> development).
>> > However the structure descriptions spanned by SHEX can be
>> defined/'programmed'(if possible) in many different ways in SPIN. So if we
>> can define SHEX in SPIN, we still need a standard that will be then a set
>> of SPIN rules/templates.
>> >
>> > I furthermore do agree with the concern (from Holger) that SHEX becomes
>> a language that would be hard-coded against a certain collection of
>> patterns only, and limited to those patterns.
>> > However, it might not be possible to define the regular expression
>> derivates in SPIN. So it should be tested to see how well we can 'program'
>> SHEX with spin and how it compares to the other 3 defined validation
>> methods/semantics if we can do it. The result of this test would allow to
>> make decision in the future.
>> > In my opinion is would be really beautiful if we can define SHEX in
>> SPIN as it would allow for easy extension and would not need extra codes to
>> validate it. Note however as SHEX is recursive it can not be done with
>> SPARQL only and SPIN function are needed.
>> >
>> > If we look at RDFS and OWL, I think there are good reasons not to
>> include that into SHEX. RDFS and especially OWL are well designed standard
>> for doing reasoning, however, there are in no way there were ever intended
>> as language to describe a database structure or to be used as Schema
>> language for validation. It is a pitty that many people misused these
>> standard for this purposes or purposes alike. For further reasons I would
>> advice you to read the conclusion of this paper (
>> http://arxiv.org/pdf/1404.1270v1.pdf).
>> > Another thing you can not do in OWL is to define the following:
>> > Type A -> ex:samepred1 -> Type B
>> > Type C -> ex:samepred1 -> Type D
>> > If you would define this in OWL you will also get, because OWL is
>> property oriented and a property can only define a range and domain
>> > Type A -> ex:samepred1 -> Type D
>> > Type C -> ex:samepred1 -> Type B
>> > Which is something we do not want.
>> > An owl file does not tell me or let me understand the structure of RDF
>> database or validate the structure, whereas a XML schema file does very
>> successfully do for an XML file.
>> > Furthermore I think that SHEX and OWL are nicely complementary to each
>> other and can link to each other via the rdfs:Property, both doing standard
>> doing something else. Shex describing the structure, SPIN for more complex
>> validation and OWL for extra semantic description(especially
>> rdfs:subPropertyOf is useful) and reasoning.
>> >
>> > If we look at SADI and OSLC ResourceShapes project are both related to
>> (web) services, this part is not included in SHEX, however related. If we
>> compare to WSDL we see that is capture the service part, but is referencing
>> to the xml schema standards. A similar thing should happen in the future to
>> SHEX, however both SADI and OSLC ResourceShapes where generated before SHEX
>> came to be so no reference here.
>> > OSLC ResourceShapes has defined something similar but more simplistic
>> then SHEX. Shex should be synchronized to this effort, which is taken into
>> consideration, however both standard are under active development.
>> > SADI has contains a description of the input and output format, however
>> these are used for the nice feature to discover services. To do so they
>> used reasoning (OWL DL) to find semantically related services which is
>> fine. However it is not possible to define a schema as in WSDL hence XML
>> schema as it something still missing in the semantic web. Although they
>> could have used SPIN.
>> >
>> > About custom syntaxes, I think a good thing it makes more readable and
>> writing a parser in these days is rather easy. Many different parser
>> (generator) tools exists. It makes it easy for users to understand and
>> start using the SHEX syntax. Off course its important a good RDF
>> representation exist so that it can be the tools and its possible to
>> analyze the SHEX definition.
>> > I do agree the syntax should be an 'user interface' to the standard
>> (SHEX/RDF) used for sharing, but that does not mean that this syntax should
>> not be standardized also.
>> > I agree it is good to look at the techniques used in the SPIN standard
>> and try to apply them for the defining of the RDF representation of SHEX.
>> Note that the defining of the RDF representation of SHEX is still work in
>> progress. I do not agree with the fact that is difficult to represent SHEX
>> as RDF.
>> > In my opinion the SPIN function/templates is missing a nice syntax to
>> write them in forcing a user either to write in N3 or use the topbraid user
>> interface, which both are not handy (at least for me) if you really want to
>> use SPIN as programming language. I do really like SPIN because it elegant
>> and powerful.
>> >
>> > For the concern on service calls, that is something like many other
>> topics that still have to researched and defined in the SHEX language, but
>> SHEX is still under active development.
>> >
>> > At last I would like to bring forward my most important use case for
>> SHEX. In the past I have been training users to use SPARQL, however the
>> number 1 problem I encounter here is that there is no method for the user
>> to understand the structure of the database so that they can create there
>> own query. Only the 2 following solutions exist: (1) The data publisher
>> explains his structure using an UML diagram or use something like VISIO
>> (example: http://beta.sparql.uniprot.org/taxonomy). (2) The user uses
>> SPARQL queries to explore and browse the database, which often starts with
>> the following query.
>> > SELECT DISTINCT(?pred)
>> > WHERE
>> > {
>> >  ?subj ?pred ?obj
>> > }
>> > The owl definition if present at all has a limited use for
>> understanding the structure of database and it can not do the thing what an
>> XML schema does for an XML file. That is why I need something like SHEX.
>> >
>> > Writing an SHEX definition will be also much more easy to understand
>> and do then writing a good and complete OWL definition.
>> >
>> > Secondly as noted on the wiki i will be using it to generate user
>> interface forms and interface code, which can not be done with neither OWL
>> or SPIN, but can be done with OSLC ResourceShapes.
>> >
>> > Conclusion to my opinion:
>> > *We need an equivalent of RELEX NG in the semantic web now represent by
>> SHEX.
>> > *We definitely should try to integrate it (if possible) with SPIN as
>> much as possible
>> > *I disagree with not having a seperate easy to read STANDARDIZED syntax.
>> > **Of course we need a good SHEX/RDF format, which should be used when
>> publishing the SHEX file
>> >
>> > Jerven could you please tell or retell what you think and expect of
>> what SHEX should do, so everybody can know about you miss.
>> >
>> > Greetz,
>> > Jesse van Dam
>> >
>>
>> -------------------------------------------------------------------
>> Jerven Bolleman                        Jerven.Bolleman@isb-sib.ch
>> SIB Swiss Institute of Bioinformatics      Tel: +41 (0)22 379 58 85
>> CMU, rue Michel Servet 1               Fax: +41 (0)22 379 58 58
>> 1211 Geneve 4,
>> Switzerland     www.isb-sib.ch - www.uniprot.org
>> Follow us at https://twitter.com/#!/uniprot
>> -------------------------------------------------------------------
>>
>>
>>
>
>
> --
> Saludos, Labra
>



-- 
Dimitris Kontokostas
Department of Computer Science, University of Leipzig
Research Group: http://aksw.org
Homepage:http://aksw.org/DimitrisKontokostas
Received on Friday, 18 July 2014 07:59:32 UTC