Re: Shapes/ShEx or the worrying issue of yet another syntax and lack of validated vision. from Jerven Bolleman on 2014-07-16 (public-rdf-shapes@w3.org from July 2014)

From: Jerven Bolleman <jerven.bolleman@isb-sib.ch>
Date: Wed, 16 Jul 2014 22:40:28 +0200
To: "Dam, Jesse van" <jesse.vandam@wur.nl>
Cc: "public-rdf-shapes@w3.org" <public-rdf-shapes@w3.org>
Message-Id: <4BFA4009-FA00-4157-8209-012E3D47EBE6@isb-sib.ch>
Dear Jesse, All,

I will try to formulate what I think is needed for validation in RDF. And am ignoring some other voiced requirements.
The issue of documentation about what data is present in a  RDF file or sparql endpoint is 
nicely covered by void, an underused ontology (See also the HCLS note on describing datasets [1]) and I would love to see more UI work done on large void files.

1. Validation must be considered with its results, rarely is validation really a boolean value, almost always more actions are required after getting invalid data.
2. Validation of information must also be considered in an information system not only on a document level.
3. Validation is an area where the details matter. We need to express acceptable ranges for literals, and IRI patterns in relation to anything.

Need 1.

Take my earlier example a citizen service number (BSN) is unique and should only occur once in the database.
Validation failure because a non unique citizen service number is added. 
Both values could be correct, the first one or the second one, no one knows.
In practice what it means is that my validator needs to infer that a “Citizen Service Number Consolidation Action” is required.
Same is true for simpler use cases, for example a simple data upload, with a missing field. 
Its not enough to say “INVALID”, you need to say “User needs to fix missing field”.

Need 2.

Again taking the BSN use case, the validation procedure needed to talk with a central system to enable real validation to occur.
For example in the intake screen there is one field for BSN, always gets filled in. 
However, we need to check that names and addresses matched what is in the other system, to see if it was really valid in 
comparison to seeing if all the check boxes are ticked.

Need 3.

The BSN does not just exist, it has a valid string form, including a checksum. This is BSN would be inside a IRI in RDF
e.g. <http://rijksoverheid.nl/burger/111222333> We need to check that it has a certain check digit.

Consequence 1.

Something like 'SPIN rule' is almost more important than 'SPIN constraint' for dealing with results of a validation.

Consequence 2.

Validation should be about RDF in federalised data systems. Something like SPARQL service is a very real need.

Consequence 3.

Being able to actively investigate inside literals and IRIs is very important. Standard set of functions that can be used to
express things like “check_digits” would be nice to have. Possibility to add functions or magic properties to software as needed 
or JS action in ShEX is a real need for developers.

I would like to note that SPIN can meet all these needs today. And having written more than 300 SPIN constraints and 
having nearly 1700 SPIN rules I think I know where its a pain and where it excels. 

I am not opposed to a compact syntax, but I would rather see a choice for an “extended" syntax.
Add to turtle what is needed to meet the mission, don’t start from scratch. (JSON-LD would be fine as well, or heck RDF/XML with sugar)

Separate design thought.

I think the charter should keep in mind the following. Make it easy to ignore extra triples, and make it hard to 
disallow such extra triples. The XML schema world closed of it eXtensibilty with to many rigid schema’s. e.g.
everyone should have a given and family name, and if they ever write down their nick name we are going to refuse all information,
instead of just taking what we needed.

Regards,
Jerven

PS. Jesse I will pass on your compliment about the schema document at http://beta.sparql.uniprot.org/taxonomy to Leyla and Sebastien who
spend a lot of effort on getting these right.

[1] http://htmlpreview.github.io/?https://github.com/joejimbo/HCLSDatasetDescriptions/blob/master/Overview.html

On 16 Jul 2014, at 15:16, Dam, Jesse van <jesse.vandam@wur.nl> wrote:

> Hi,
> 
> This is my opinion and comments based on some off the concerns raised by Jerven Bollemand and Holger  KnubLauch.
> 
> In my opinion a schema language for graph databases is missing. We need a schema language that can do the same thing as the other schema languages do for XML.
> A list of useful usages of these schema languages can be found on the net and it is clear that the solution created for XML files called RELAX-NG is used and successful. To me its clear that such a solution is missing in the world of graph database and if created will used a lot. 
> 
> I, however, agree that the list intended usages on the SHEX wiki is incomplete and that the statement "validate RDF documents" is to broad as validation can be done for different kind of purposes. Shex will most likely not going to cover all this purposes, but SHEX +SPIN will probably will do.
> 
> The reasons stated above is why SHEX is heavily inspired by RELAX-NG. However when translating this to the semantic web we have to take into consideration other existing standards in the semantic web, which include SPIN, OSLC ResourceShapes project, RDFS, OWL and SADI. 
> 
> SHEX is inspired by regular expressions and RELAX-NG, which was also inspired by regular expression. So that why a lot of time already has been spent to try to synchronize SHEX as much as possible with these sources. This effort already resulted in 3 different validation methods/semantics, of which the last one is directly inspired by RELAX-NG regular expression derivates. (with thanks to Jose Emilioi Labra Gayo). 
> Effort are done to synchronize as much as possible with regular expression, so please let us know if you think there is an option to better synchronize it. As you noted it is not.
>> Did you notice that your use of the question mark is not consistent with any other commonly used syntax e.g. egex, globs, trinary logic etc.. For sure leading to a lot of confusion.
> I recommend you to take a look at this page http://www.w3.org/2013/ShEx/EvaluationLogic.html. 
> 
> Now looking at SPIN. I do agree we should align as much as possible with SPIN and use SPIN where possible. Note that we are not building a validation language that can do all, but captures that what RELAX-NG can do for xml as SPIN is already existing that can do the complex things. 
> That why there is an inclusion of semantic action, which can be defined in SPARQL and if everything is converted to RDF it will automatically become SPIN (No mentioning of that on yet as SHEX is still under development).
> However the structure descriptions spanned by SHEX can be defined/'programmed'(if possible) in many different ways in SPIN. So if we can define SHEX in SPIN, we still need a standard that will be then a set of SPIN rules/templates.
> 
> I furthermore do agree with the concern (from Holger) that SHEX becomes a language that would be hard-coded against a certain collection of patterns only, and limited to those patterns. 
> However, it might not be possible to define the regular expression derivates in SPIN. So it should be tested to see how well we can 'program' SHEX with spin and how it compares to the other 3 defined validation methods/semantics if we can do it. The result of this test would allow to make decision in the future.
> In my opinion is would be really beautiful if we can define SHEX in SPIN as it would allow for easy extension and would not need extra codes to validate it. Note however as SHEX is recursive it can not be done with SPARQL only and SPIN function are needed.
> 
> If we look at RDFS and OWL, I think there are good reasons not to include that into SHEX. RDFS and especially OWL are well designed standard for doing reasoning, however, there are in no way there were ever intended as language to describe a database structure or to be used as Schema language for validation. It is a pitty that many people misused these standard for this purposes or purposes alike. For further reasons I would advice you to read the conclusion of this paper (http://arxiv.org/pdf/1404.1270v1.pdf). 
> Another thing you can not do in OWL is to define the following:
> Type A -> ex:samepred1 -> Type B 
> Type C -> ex:samepred1 -> Type D
> If you would define this in OWL you will also get, because OWL is property oriented and a property can only define a range and domain
> Type A -> ex:samepred1 -> Type D
> Type C -> ex:samepred1 -> Type B
> Which is something we do not want.
> An owl file does not tell me or let me understand the structure of RDF database or validate the structure, whereas a XML schema file does very successfully do for an XML file. 
> Furthermore I think that SHEX and OWL are nicely complementary to each other and can link to each other via the rdfs:Property, both doing standard doing something else. Shex describing the structure, SPIN for more complex validation and OWL for extra semantic description(especially rdfs:subPropertyOf is useful) and reasoning.
> 
> If we look at SADI and OSLC ResourceShapes project are both related to (web) services, this part is not included in SHEX, however related. If we compare to WSDL we see that is capture the service part, but is referencing to the xml schema standards. A similar thing should happen in the future to SHEX, however both SADI and OSLC ResourceShapes where generated before SHEX came to be so no reference here. 
> OSLC ResourceShapes has defined something similar but more simplistic then SHEX. Shex should be synchronized to this effort, which is taken into consideration, however both standard are under active development.
> SADI has contains a description of the input and output format, however these are used for the nice feature to discover services. To do so they used reasoning (OWL DL) to find semantically related services which is fine. However it is not possible to define a schema as in WSDL hence XML schema as it something still missing in the semantic web. Although they could have used SPIN.
> 
> About custom syntaxes, I think a good thing it makes more readable and writing a parser in these days is rather easy. Many different parser (generator) tools exists. It makes it easy for users to understand and start using the SHEX syntax. Off course its important a good RDF representation exist so that it can be the tools and its possible to analyze the SHEX definition.
> I do agree the syntax should be an 'user interface' to the standard (SHEX/RDF) used for sharing, but that does not mean that this syntax should not be standardized also.
> I agree it is good to look at the techniques used in the SPIN standard and try to apply them for the defining of the RDF representation of SHEX. Note that the defining of the RDF representation of SHEX is still work in progress. I do not agree with the fact that is difficult to represent SHEX as RDF.
> In my opinion the SPIN function/templates is missing a nice syntax to write them in forcing a user either to write in N3 or use the topbraid user interface, which both are not handy (at least for me) if you really want to use SPIN as programming language. I do really like SPIN because it elegant and powerful.
> 
> For the concern on service calls, that is something like many other topics that still have to researched and defined in the SHEX language, but SHEX is still under active development.
> 
> At last I would like to bring forward my most important use case for SHEX. In the past I have been training users to use SPARQL, however the number 1 problem I encounter here is that there is no method for the user to understand the structure of the database so that they can create there own query. Only the 2 following solutions exist: (1) The data publisher explains his structure using an UML diagram or use something like VISIO (example: http://beta.sparql.uniprot.org/taxonomy). (2) The user uses SPARQL queries to explore and browse the database, which often starts with the following query.
> SELECT DISTINCT(?pred)
> WHERE
> {
>  ?subj ?pred ?obj
> }
> The owl definition if present at all has a limited use for understanding the structure of database and it can not do the thing what an XML schema does for an XML file. That is why I need something like SHEX. 
> 
> Writing an SHEX definition will be also much more easy to understand and do then writing a good and complete OWL definition.
> 
> Secondly as noted on the wiki i will be using it to generate user interface forms and interface code, which can not be done with neither OWL or SPIN, but can be done with OSLC ResourceShapes.
> 
> Conclusion to my opinion:
> *We need an equivalent of RELEX NG in the semantic web now represent by SHEX.
> *We definitely should try to integrate it (if possible) with SPIN as much as possible
> *I disagree with not having a seperate easy to read STANDARDIZED syntax.
> **Of course we need a good SHEX/RDF format, which should be used when publishing the SHEX file
> 
> Jerven could you please tell or retell what you think and expect of what SHEX should do, so everybody can know about you miss.
> 
> Greetz,
> Jesse van Dam
> 

-------------------------------------------------------------------
Jerven Bolleman                        Jerven.Bolleman@isb-sib.ch
SIB Swiss Institute of Bioinformatics      Tel: +41 (0)22 379 58 85
CMU, rue Michel Servet 1               Fax: +41 (0)22 379 58 58
1211 Geneve 4,
Switzerland     www.isb-sib.ch - www.uniprot.org
Follow us at https://twitter.com/#!/uniprot
-------------------------------------------------------------------
Received on Wednesday, 16 July 2014 20:41:02 UTC