Re: shapes-ISSUE-30 (shape-and-data-graphs): Are shapes and data in the same graph? [SHACL Spec] from Holger Knublauch on 2015-04-13 (public-data-shapes-wg@w3.org from April 2015)

From: Holger Knublauch <holger@topquadrant.com>
Date: Mon, 13 Apr 2015 10:01:21 +1000
To: public-data-shapes-wg <public-data-shapes-wg@w3.org>
Message-ID: <552B0751.2030600@topquadrant.com>
One of the main selling points of RDF technology has always been the 
fact that instance and schema are represented uniformly. RDF Schema and 
OWL class definitions are instances (of metaclasses) themselves. This 
means that such data can not only be stored and shared together, but 
also be queried uniformly. In general, SPARQL queries can freely walk 
between meta-levels.

Many other formalisms such as XML and SQL databases have a stricter 
separation between those levels. If we agree on a similarly strict 
separation by making it impossible to query the shapes graph from the 
instances graph (and vice versa), then we may throw away a unique 
advantage that RDF technology has. I am generally not in favor of 
selecting the lowest common denominator for all use cases, only because 
certain cases may not have the best performance.

I understand that we need to maintain good performance, including the 
ability to use native query optimizations on database level where 
possible. Also there are cases where the shapes model is really totally 
separate from the database. Yet I believe there are also cases where 
being able to access the shapes definitions at runtime is beneficial.

In this discussion here, I believe we should distinguish between what we 
use in the SPARQL queries of the specification versus what optimized 
implementations may do. I believe it should be doable to assume that - 
in the context of the spec - the shapes graph can be in the same dataset 
as the actual data. So by default we would have a single dataset and 
validation gets two parameters:

- the URI of the "instances" data graph (default graph)
- the URI of the shapes graph

An example of how this would work, with a single query is the body of 
sh:allowedValues:

         SELECT ?this (?this AS ?subject) ?predicate ?object
         WHERE {
             ?this ?predicate ?object .
             FILTER NOT EXISTS {
                 GRAPH ?shapesGraph {
                     ?allowedValues (rdf:rest*)/rdf:first ?object .
                 }
             }
         }

If the instances graph is in fact a remote database, then there are two 
ways to access it
a) via a proxy graph API (as jena would do it by default)
b) generate queries and send them to the end point directly

In case b), queries could no longer access the shapes graph, so they 
would need to include enough information to be self-contained. For all 
built-in core elements, this should be easy. Just replace the GRAPH 
?shapesGraph above with a FILTER NOT IN ..., and for sh:shape create a 
large nested query, same for OrConstraint and closed shapes (if we 
support these). But these things could be regarded as optimizations that 
any engine can implement itself, just like most engines may optimize 
certain recurring patterns and hard-code them instead of relying on the 
provided official SPARQL queries of their templates.

Any other custom constraint that needs to access the shapes graph can be 
executed via mechanism a). This may mean that its performance may not be 
ideal, yet at least we have a simpler job of writing the spec and 
maintain improved flexibility for those (many) users that have shapes 
and data graphs on the same database.

Summary: Generally allow to use ?shapesGraph at runtime while making 
sure that optimizations remain possible for the majority of use cases.

Dimitris, would this help as an approximation? I can elaborate if you 
like, or we could talk off-list (I am easy to find on Skype).

Regards,
Holger


On 4/10/2015 17:20, Dimitris Kontokostas wrote:
>
>
> On Fri, Apr 10, 2015 at 10:01 AM, Holger Knublauch 
> <holger@topquadrant.com <mailto:holger@topquadrant.com>> wrote:
>
>     BTW another example of a constraint where the WHERE clause would
>     benefit from querying the shapes graph itself is Closed Shapes.
>     These could be modeled using
>
>     ex:MyShape
>         sh:property [
>             ...
>         ] ;
>         sh:constraint [
>             a sh:ClosedShapeConstraint .
>         ]
>
>     where sh:ClosedShapeConstraint would walk the definition of
>     sh:MyShape (and possibly its super-shapes) to collect all
>     sh:predicates that are used. Then check that the instance has no
>     property that is not among those predicates.
>
>
> Again this is an implementation optimization. The engine could 
> prebuild an additional query based on the shape definition in advance. 
> Of course this also depends on the semantics of the closed shapes.
> see an example in 
> https://lists.w3.org/Archives/Public/public-data-shapes-wg/2015Mar/0080.html
>
>     I believe the opportunities here are great and we shouldn't limit
>     such scenarios to emerge, one way or another. With a generic
>     solution anyone could define variations of things like Closed
>     Shapes themselves in their own macro library.
>
>
> For me it is fine to have a generic solution as long as this solution 
> works in all cases.
> Revised proposed resolution:Shapes and data are expected to exist in 
> different graphs unless specified specified otherwise and access from 
> the shapes graph to the data graph and vice-versa is not required.
>
> Would anyone object to this?
>
> Best,
> Dimitris
>
>
>
>     Holger
>
>
>     On 4/10/15 4:35 PM, Dimitris Kontokostas wrote:
>>
>>
>>     On Fri, Apr 10, 2015 at 8:19 AM, Holger Knublauch
>>     <holger@topquadrant.com <mailto:holger@topquadrant.com>> wrote:
>>
>>         On 4/10/2015 15:12, Dimitris Kontokostas wrote:
>>
>>
>>             I think you are referring  to sh:valueShape and the
>>             sh:hasShape(?shape) function right? I don't see any other
>>             case that could be problematic.
>>
>>
>>         Also sh:OrConstraint (or any similar template that we or
>>         users may want to add, such as negation and intersection). 
>>
>>
>>     Why can't we move these into the validation engine? e.g. (SPARQL
>>     Q1) or/xor/... (SPARQL Q2)
>>
>>         And sh:allowedValues (which take a list or set of values, and
>>         those must reside somewhere, I guess they should reside with
>>         the shapes) - more general any template that takes rdf:List
>>         arguments that need to be walked at runtime.
>>
>>
>>     These should indeed reside in the shapes graph(s).
>>     Implementations could either pre-build the queries or build them
>>     at run-time.
>>     When we are working on immutable datasets (i.e. endpoints)
>>     pre-building the values in the queries would be the only option.
>>     Implementations with other use cases could optimize this.
>>
>>
>>             In this case, I was waiting for some clear definition for
>>             recursion in order to make a proposal but I think we have
>>             many options to go with.
>>             For example: If the data and the constraints are in the
>>             same graph we can use the sh:hasShape() function you
>>             propose, otherwise use algorithm X to execute the ShEx
>>             validation in multiple steps or Algorithm Y to convert
>>             the ShEx shape into a (giant) SPARQL query similar to the
>>             ShEx 2 SPARQL [1].
>>
>>
>>         I don't think we should limit ourselves to the hard-coded
>>         built-ins of "ShEx" here - this should work with any
>>         user-defined template/macro too.
>>
>>             If recursion is forbidden, things get much simpler and
>>             maybe - I need to work on this first to say for sure -
>>             ShEx shapes could be just treated as class shapes with an
>>             extra SPARQL filter.
>>
>>             We need to have a clear definition of the ShEx shapes to
>>             see our options and we shouldn't limit the language
>>             design in advance.
>>
>>             Proposed resolution:Shapes and data are expected to exist
>>             in different graphs unless specified specified otherwise
>>
>>
>>         Agreed. In some cases the graph called the shapes graph could
>>         be identical with the data graph though - it would just be
>>         accessed via a magic named graph name or GRAPH ?variable.
>>
>>
>>     Indeed, the user could specify that they are identical in many
>>     cases and implementations can optimize execution in these cases,
>>     But I think 'GRAPH ?variable' is an implementation detail, the
>>     spec should assume that the data graph cannot access the shapes
>>     graph - or provide alternative(s)
>>
>>
>>
>>         Holger
>>
>>
>>
>>
>>
>>     -- 
>>     Dimitris Kontokostas
>>     Department of Computer Science, University of Leipzig & DBpedia
>>     Association
>>     Projects: http://dbpedia.org, http://http://aligned-project.eu
>>     Homepage:http://aksw.org/DimitrisKontokostas
>>     Research Group: http://aksw.org
>>
>
>
>
>
> -- 
> Dimitris Kontokostas
> Department of Computer Science, University of Leipzig & DBpedia 
> Association
> Projects: http://dbpedia.org <http://dbpedia.org>, 
> http://http://aligned-project.eu <http://aligned-project.eu>
> Homepage:http://aksw.org/DimitrisKontokostas 
> <http://aksw.org/DimitrisKontokostas>
> Research Group: http://aksw.org <http://aksw.org>
>
Received on Monday, 13 April 2015 00:02:52 UTC