AW: Thoughts on validation requirements

Hi Dimitris

Although I do not have any industry experience in this field, I have the following to note from my related research.

If we want RDF to become mainstream we shouldn't expect people to learn OWL, logics & Manchester syntax in order to formulate or understand a simple constraint.
They should exist somehow but should be moved as many levels up as possible. Similarly for SPARQL.

Regarding ShEx:

- I am also unconfortable with the un-typed validation but I also see the need to support it. Unless of course RDF somewhere specifies that every resource MUST have a rdf:type. This however should not be the primary focus of ShEx since it is not the common case.


- Shapes related to types (as described in Resource Shapes) should be specified more explicitly and promoted. In general, these rules are easier to validate since you can define the selectivity based on the type and is more common in practice.


- I also agree with Antoine Issac that some more emphasis should be given to OWL

- further modularization is needed to the syntax. In almost all cases a a foaf:name has the same range (and the same domain) in a single document/graph. Stating these rules separately make the rule execution more efficient.
e.g. I can independently check the range (and domain) of foaf:name and inside the shape I only check it's existence (if specified).


General requirements from a validation solution

- Rule severity level. Not all errors are equal and we need somehow to distinguish them. RDFUnit uses rlog [RLOG] but anything related (e.g. part of RFC2119) could do. (see [LEVEL])


- Annotations: There should be a (standard) way of people to define annotation on top of rules. These annotations could serve many purposes from error classification to commands on how to process the errors.


- Descriptions: Every rule should attach an error message for the end user. Some messages can be generated automatically but some cannot and the language must provide this facility


- Results & execution level. There should be different execution models with different results serializations. e.g. I want only a success / fail, only the error count per rule, all the individual erroneous resources or error instances with annotations. (I know that we need to fix the validation language first)


- I also mentioned earlier about owl-reuse for automatic rule generation and rules attached to vocabularies [REUSE] as well as type inference [INFERENCE].

RDFUnit in the middle too

I try to tackle all these issues in my implementation but I had to develop my own rdf model and it's quite hard to write RDF & SPARQL manually.
We support OWL (partially) so I used it when possible but it is not so straightforward as well.
if OSLC resource shapes was submitted earlier I might have used that instead for common cases (although it can be further extended).
>From the top of my head implementing OSLC would be as easy as providing a configuration file such as this [OWL-CONFIG] to cover the (typed) spec.
SPIN was also limiting in our approach, not only for the aforementioned requirements, but for reasons described in [RDFUNIT section 7]. However, RDFUnit could easily export everything to SPIN as well. My point is that all three existing solutions and more or less interoperable in terms or verifying  constraints.

RDFUnit is a 1 year R&D project and of course I do not dare to compare it to full-stack enterprise solutions like SPIN & ICV. We reused concepts from both approaches but I think neither of them is perfect as is. What I miss is an easy & compact syntax to write validation rules and looks like ShEx has a good potential on providing that.
(also note that this refers to writing/reading rules in a text editor, behind a rich user interfaces everything looks nice & easy)


Dimitris Kontokostas
Department of Computer Science, University of Leipzig
Research Group:

Received on Sunday, 27 July 2014 09:36:35 UTC