RE: Shapes/ShEx or the worrying issue of yet another syntax and lack of validated vision. from Dam, Jesse van on 2014-07-16 (public-rdf-shapes@w3.org from July 2014)

From: Dam, Jesse van <jesse.vandam@wur.nl>
Date: Wed, 16 Jul 2014 13:16:08 +0000
To: "public-rdf-shapes@w3.org" <public-rdf-shapes@w3.org>
Message-ID: <63CF398D7F09744BA51193F17F5252AB1641CC1F@SCOMP0936.wurnet.nl>

Hi,

This is my opinion and comments based on some off the concerns raised by Jerven Bollemand and Holger KnubLauch.

In my opinion a schema language for graph databases is missing. We need a schema language that can do the same thing as the other schema languages do for XML.
A list of useful usages of these schema languages can be found on the net and it is clear that the solution created for XML files called RELAX-NG is used and successful. To me its clear that such a solution is missing in the world of graph database and if created will used a lot.

I, however, agree that the list intended usages on the SHEX wiki is incomplete and that the statement "validate RDF documents" is to broad as validation can be done for different kind of purposes. Shex will most likely not going to cover all this purposes, but SHEX +SPIN will probably will do.

The reasons stated above is why SHEX is heavily inspired by RELAX-NG. However when translating this to the semantic web we have to take into consideration other existing standards in the semantic web, which include SPIN, OSLC ResourceShapes project, RDFS, OWL and SADI.

SHEX is inspired by regular expressions and RELAX-NG, which was also inspired by regular expression. So that why a lot of time already has been spent to try to synchronize SHEX as much as possible with these sources. This effort already resulted in 3 different validation methods/semantics, of which the last one is directly inspired by RELAX-NG regular expression derivates. (with thanks to Jose Emilioi Labra Gayo).
Effort are done to synchronize as much as possible with regular expression, so please let us know if you think there is an option to better synchronize it. As you noted it is not.
>Did you notice that your use of the question mark is not consistent with any other commonly used syntax e.g. egex, globs, trinary logic etc.. For sure leading to a lot of confusion.
I recommend you to take a look at this page http://www.w3.org/2013/ShEx/EvaluationLogic.html.

Now looking at SPIN. I do agree we should align as much as possible with SPIN and use SPIN where possible. Note that we are not building a validation language that can do all, but captures that what RELAX-NG can do for xml as SPIN is already existing that can do the complex things.
That why there is an inclusion of semantic action, which can be defined in SPARQL and if everything is converted to RDF it will automatically become SPIN (No mentioning of that on yet as SHEX is still under development).
However the structure descriptions spanned by SHEX can be defined/'programmed'(if possible) in many different ways in SPIN. So if we can define SHEX in SPIN, we still need a standard that will be then a set of SPIN rules/templates.

I furthermore do agree with the concern (from Holger) that SHEX becomes a language that would be hard-coded against a certain collection of patterns only, and limited to those patterns.
However, it might not be possible to define the regular expression derivates in SPIN. So it should be tested to see how well we can 'program' SHEX with spin and how it compares to the other 3 defined validation methods/semantics if we can do it. The result of this test would allow to make decision in the future.
In my opinion is would be really beautiful if we can define SHEX in SPIN as it would allow for easy extension and would not need extra codes to validate it. Note however as SHEX is recursive it can not be done with SPARQL only and SPIN function are needed.

If we look at RDFS and OWL, I think there are good reasons not to include that into SHEX. RDFS and especially OWL are well designed standard for doing reasoning, however, there are in no way there were ever intended as language to describe a database structure or to be used as Schema language for validation. It is a pitty that many people misused these standard for this purposes or purposes alike. For further reasons I would advice you to read the conclusion of this paper (http://arxiv.org/pdf/1404.1270v1.pdf).
Another thing you can not do in OWL is to define the following:
Type A -> ex:samepred1 -> Type B
Type C -> ex:samepred1 -> Type D
If you would define this in OWL you will also get, because OWL is property oriented and a property can only define a range and domain
Type A -> ex:samepred1 -> Type D
Type C -> ex:samepred1 -> Type B
Which is something we do not want.
An owl file does not tell me or let me understand the structure of RDF database or validate the structure, whereas a XML schema file does very successfully do for an XML file.
Furthermore I think that SHEX and OWL are nicely complementary to each other and can link to each other via the rdfs:Property, both doing standard doing something else. Shex describing the structure, SPIN for more complex validation and OWL for extra semantic description(especially rdfs:subPropertyOf is useful) and reasoning.

If we look at SADI and OSLC ResourceShapes project are both related to (web) services, this part is not included in SHEX, however related. If we compare to WSDL we see that is capture the service part, but is referencing to the xml schema standards. A similar thing should happen in the future to SHEX, however both SADI and OSLC ResourceShapes where generated before SHEX came to be so no reference here.
OSLC ResourceShapes has defined something similar but more simplistic then SHEX. Shex should be synchronized to this effort, which is taken into consideration, however both standard are under active development.
SADI has contains a description of the input and output format, however these are used for the nice feature to discover services. To do so they used reasoning (OWL DL) to find semantically related services which is fine. However it is not possible to define a schema as in WSDL hence XML schema as it something still missing in the semantic web. Although they could have used SPIN.

About custom syntaxes, I think a good thing it makes more readable and writing a parser in these days is rather easy. Many different parser (generator) tools exists. It makes it easy for users to understand and start using the SHEX syntax. Off course its important a good RDF representation exist so that it can be the tools and its possible to analyze the SHEX definition.
I do agree the syntax should be an 'user interface' to the standard (SHEX/RDF) used for sharing, but that does not mean that this syntax should not be standardized also.
I agree it is good to look at the techniques used in the SPIN standard and try to apply them for the defining of the RDF representation of SHEX. Note that the defining of the RDF representation of SHEX is still work in progress. I do not agree with the fact that is difficult to represent SHEX as RDF.
In my opinion the SPIN function/templates is missing a nice syntax to write them in forcing a user either to write in N3 or use the topbraid user interface, which both are not handy (at least for me) if you really want to use SPIN as programming language. I do really like SPIN because it elegant and powerful.

For the concern on service calls, that is something like many other topics that still have to researched and defined in the SHEX language, but SHEX is still under active development.

At last I would like to bring forward my most important use case for SHEX. In the past I have been training users to use SPARQL, however the number 1 problem I encounter here is that there is no method for the user to understand the structure of the database so that they can create there own query. Only the 2 following solutions exist: (1) The data publisher explains his structure using an UML diagram or use something like VISIO (example: http://beta.sparql.uniprot.org/taxonomy). (2) The user uses SPARQL queries to explore and browse the database, which often starts with the following query.
SELECT DISTINCT(?pred)
WHERE
{
?subj ?pred ?obj
}
The owl definition if present at all has a limited use for understanding the structure of database and it can not do the thing what an XML schema does for an XML file. That is why I need something like SHEX.

Writing an SHEX definition will be also much more easy to understand and do then writing a good and complete OWL definition.

Secondly as noted on the wiki i will be using it to generate user interface forms and interface code, which can not be done with neither OWL or SPIN, but can be done with OSLC ResourceShapes.

Conclusion to my opinion:
*We need an equivalent of RELEX NG in the semantic web now represent by SHEX.
*We definitely should try to integrate it (if possible) with SPIN as much as possible
*I disagree with not having a seperate easy to read STANDARDIZED syntax.
**Of course we need a good SHEX/RDF format, which should be used when publishing the SHEX file

Jerven could you please tell or retell what you think and expect of what SHEX should do, so everybody can know about you miss.

Greetz,
Jesse van Dam

Received on Wednesday, 16 July 2014 13:16:39 UTC