Re: Reopening the discussion on sh:targetShape

On 7/07/2020 16:40, Håvard Ottestad wrote:
> Hi Holger,
>
> It’s not slower than the equivalent SPARQL target.
I never claimed the opposite.
>
> The optimized and combined query does not need to visit all subjects 
> and objects in the graph since it can reorder the join based on 
> statistics.
You would need to prove this for all possible combinations of SHACL Core 
and user-defined constraint types, not just for selected toy examples.
>
> I would say it’s actually better than SPARQL targets because it allows 
> for optimizations when validating transactions.
>
> What was the discussion around performance when SPARQL targets were 
> introduced?

I don't know what you mean. With SPARQL targets it is under the control 
of the shape author to select exactly the nodes and write an efficient 
query. It's an advanced feature. A core feature sh:targetShape would 
invite more people to run into slow cases. It puts the burden on 
implementers to write completely new algorithms (such as reordering on 
statistics) only to support that one feature.

I suggest we wait for feedback from other implementers before continuing 
this. This is the SHACL Community Group here, not just a dialog between 
the two of us. If you want something that works for your platform just 
do what you please.

Holger


>
> Cheers,
> Håvard
>
>> On 7 Jul 2020, at 00:58, Holger Knublauch <holger@topquadrant.com> wrote:
>>
>> 
>>
>>
>> On 7/07/2020 00:05, Håvard Ottestad wrote:
>>> Hi Holger,
>>>
>>> I've typed up some queries I believe will validate examples 2, 3 and 
>>> 4 at least as efficiently as a SPARQL target.
>>
>> Sure, but that is not the question. The problem with sh:targetShape 
>> is that it requires to visit all subjects and objects in the whole 
>> graph. It doesn't matter then whether you execute, say, the 
>> sh:pattern as SPARQL regex or natively implemented. It is still a 
>> horribly slow loop that won't scale. The only cases where it will 
>> scale are the few constraint types that avoid the iteration (e.g. 
>> hasValue) and of course filtering out nodes if you already have a 
>> small set of candidates, e.g. from a change set in a transaction.
>>
>> This is a show stopper for the whole generalized concept of target 
>> shapes which is why I suggest to limit the use cases that this can 
>> cover to hasValue scenarios that are efficient either way. As soon as 
>> we offer a too-powerful but slow feature, users will run into 
>> performance issues that will backfire on SHACL in general. Let's 
>> better be conservative with what we put into the language.
>>
>> Holger
>>
>>
>>>
>>> If you agree that these queries are correct (or close enough to 
>>> being correct), then we know that all those examples at least can be 
>>> implemented at least as performant as SPARQL targets.
>>>
>>> If there is a way to evaluate all target shapes that is as fast or 
>>> faster than using SPARQL targets then I think that sh:targetShape 
>>> should be considered on the same terms as SPARQL targets performance 
>>> wise, regardless of the poor performance you are currently seeing.
>>>
>>> I can think of one consideration that would either not perform well, 
>>> or at least be very difficult to implement, and that is recursive 
>>> SHACL since SPARQL does not support recursion. I believe that this 
>>> should not hold us back from considering sh:targetShape.
>>>
>>> Here is the list of queries.
>>>
>>> ###############################################################################
>>> # 2. Instances in namespace "company" must have appropriate class 
>>> and dc:type #
>>> ###############################################################################
>>> ex:CompanyShape a sh:NodeShape;
>>>   sh:targetShape [
>>>     sh:nodeKind sh:IRI;
>>>     sh:pattern "^https://company-graph.example.com/resource/company/";
>>>   ];
>>>   sh:class ex:Company;
>>>   sh:property [sh:path dc:type; sh:in ("conglomerate" "collective" 
>>> "enterprise")];
>>> .
>>>
>>> ### Target query ###
>>> select ?target where {
>>> {
>>> ?target ?b ?c.
>>> FILTER(isIRI(?target) && regex(str(?target), 
>>> "^https://company-graph.example.com/resource/company/") )
>>> } union {
>>> ?a ?b ?target.
>>> FILTER(isIRI(?target) && regex(str(?target), 
>>> "^https://company-graph.example.com/resource/company/") )
>>> }
>>>
>>> }
>>>
>>> ### Combined query ###
>>> select ?target ?value where {
>>> {
>>> ?target ?b ?c.
>>> ?target dc:type ?value.
>>> FILTER(NOT IN ("conglomerate" "collective" "enterprise"))
>>> FILTER(isIRI(?target) && regex(str(?target), 
>>> "^https://company-graph.example.com/resource/company/") )
>>> } union {
>>> ?a ?b ?target.
>>> ?target dc:type ?value.
>>> FILTER(NOT IN ("conglomerate" "collective" "enterprise"))
>>> FILTER(isIRI(?target) && regex(str(?target), 
>>> "^https://company-graph.example.com/resource/company/") )
>>> }
>>>
>>> }
>>>
>>> ### Optimized combined query for property shapes###
>>> select ?target ?value where {
>>> ?target dc:type ?value.
>>> FILTER(NOT IN ("conglomerate" "collective" "enterprise"))
>>> FILTER(isIRI(?target) && regex(str(?target), 
>>> "^https://company-graph.example.com/resource/company/") )
>>>
>>> }
>>>
>>>
>>> #####################################################################
>>> # 3. All langStrings must have one of a predefined set of languages #
>>> #####################################################################
>>> ex:langStringShape a sh:NodeShape;
>>>   sh:targetShape [sh:datatype rdf:langString];
>>>   sh:languageIn ("en" "bg");
>>> .
>>>
>>>
>>> ### Target query ###
>>> select ?target where {
>>> ?a ?b ?target.
>>> FILTER(DATATYPE(?target) = rdf:langString)
>>> }
>>>
>>> ### Combined query ###
>>> select ?target where {
>>> ?a ?b ?target.
>>> FILTER(DATATYPE(?target) = rdf:langString)
>>> FILTER(!langMatches(lang(?title), "en") && 
>>> !langMatches(lang(?title), "bg"))
>>> }
>>>
>>>
>>> #########################################################################################
>>> # 4. Steve is very popular, so everyone who knows at least three 
>>> people must know Steve #
>>> #########################################################################################
>>> ex:Personshape a sh:NodeShape;
>>>   sh:targetShape [sh:path foaf:knows; sh:minCount 3];
>>>   sh:property [sh:path foaf:knows; sh:hasValue ex:Steve];
>>> .
>>>
>>>
>>> ### Target query ###
>>> select ?target where {
>>> ?target foaf:knows ?count_0, ?count_1, ?count_2.
>>> FILTER(?count_0 != ?count_1)
>>> FILTER(?count_1 != ?count_2)
>>> FILTER(?count_2 != ?count_0)
>>> }
>>>
>>> ### Combined query ###
>>> select ?target ?value where {
>>> ?target foaf:knows ?count_0, ?count_1, ?count_2.
>>> FILTER(?count_0 != ?count_1)
>>> FILTER(?count_1 != ?count_2)
>>> FILTER(?count_2 != ?count_0)
>>>
>>> ?target foaf:knows ?value.
>>> FILTER NOT EXISTS {?target foaf:knows ex:Steve}
>>> }
>>>
>>>
>>> Cheers,
>>> Håvard
>>>
>>> PS. I believe that example 2 is actually a bit wrong, I'll comment 
>>> on the PR instead of in this email.
>>>
>>> On Mon, Jul 6, 2020 at 1:49 PM Irene Polikoff <irene@topquadrant.com 
>>> <mailto:irene@topquadrant.com>> wrote:
>>>
>>>     But if there is no agreement, then I am concerned about using
>>>     sh: namespace for this new construct. This does not seem right.
>>>
>>>     TQ has been adding some custom constructs into dash: namespace,
>>>     for example. Other namespaces for custom extensions could be
>>>     used, as well.
>>>
>>>     I believe sh: namespace should be reserved for things that a
>>>     majority of implementers have reached consensus on. I recognize
>>>     that currently we only have 2 implementers in this discussion.
>>>     It would be better to broaden the circle of people looking at
>>>     this topic.
>>>
>>>     If we have a strong disagreement, then the process to follow is
>>>     normally:
>>>
>>>     1. Have a call (or calls) between the concerned parties to see
>>>     if they can reach an agreement to come to some compromise.
>>>     2. If not and the objection is strong ( as in “will not
>>>     implement”), then typically, the feature would not make it to
>>>     the spec. Implementers can still do it as a custom extension.
>>>
>>>     In the past, we had some hypothetical and/or philosophical
>>>     objections that slowed progress by many many months. It was
>>>     frustrating. An objection based on the implementation experience
>>>     is, however, different. I believe it needs a serious
>>>     consideration and a resolution - even if it slows progress.
>>>     Unfortunately, there is no way to avoid some sluggishness when
>>>     multiple parties need to come to consensus - this is a downside
>>>     of standards development.
>>>
>>>     Irene
>>>
>>>     > On Jul 6, 2020, at 6:06 AM, Holger Knublauch
>>>     <holger@topquadrant.com <mailto:holger@topquadrant.com>> wrote:
>>>     >
>>>     >> On 6/07/2020 16:53, Håvard Ottestad wrote:
>>>     >>
>>>     >> Hi Holger,
>>>     >>
>>>     >> Could you share the shape that has particularly bad
>>>     performance so I can see if I can think of an optimal solution?
>>>     > For examples 2,3,4 from https://github.com/w3c/shacl/pull/3
>>>     >>
>>>     >> My plan has essentially been to convert the targetShape into
>>>     a sparql query. This would put the performance in the same realm
>>>     as sparql targets.
>>>     >>
>>>     >> The benefits of targetShape over sparql targets is that it’s
>>>     possible to validate the changes to a database efficiently, we
>>>     are seeing O(c) performance where c is the effective size of the
>>>     change instead of O(n) which is what we were seeing with sparql
>>>     targets (where n is the size of the database).
>>>     >
>>>     > The same algorithms can be applied to the dash:HasValueTarget
>>>     - it's just as declarative as the other shapes.
>>>     >
>>>     > I think we (I at least) had discussed this sufficiently. I
>>>     think I was clear on the performance issues. I think we can
>>>     agree to disagree and move on. I was hoping that other
>>>     implementers may have additional input.
>>>     >
>>>     > Holger
>>>     >
>>>     >
>>>     >>
>>>     >> Håvard
>>>     >>
>>>     >>> On 6 Jul 2020, at 03:02, Holger Knublauch
>>>     <holger@topquadrant.com <mailto:holger@topquadrant.com>> wrote:
>>>     >>>
>>>     >>> There have been various discussions around SHACL target
>>>     extensions, and there is an open Pull Request
>>>     https://github.com/w3c/shacl/pull/3 to add sh:targetShape as a
>>>     new target type to SHACL-AF. I have meanwhile attempted to
>>>     implement that feature in our code base and have concluded that
>>>     the feature is not a good idea for SHACL-AF (or even SHACL
>>>     Core). The main argument is still about performance:
>>>     >>>
>>>     >>> - I stated that the worst-case performance of this general
>>>     feature is *catastrophic* as it needs to perform validation on
>>>     all subjects and objects only to determine which nodes it then
>>>     needs to validate for real. This means that sh:targetShape is
>>>     very different from the other 4 built-in target types
>>>     (sh:targetClass, sh:targetNode, sh:targetSubjectsOf,
>>>     sh:targetObjectsOf) in that it requires validation before
>>>     validation (which by itself causes implementation complexity).
>>>     >>>
>>>     >>> - Håvard stated that the alternative, SPARQL-based targets
>>>     has bad performance for his implementation.
>>>     >>>
>>>     >>> We do have similar use cases to yours, esp around
>>>     dependencies across multiple properties. For example:
>>>     >>>
>>>     >>> IF ex:country=USA THEN ex:state sh:in [ "AZ", "CA", "FL" ... ]
>>>     >>> IF ex:country=AU THEN ex:state sh:in [ "NSW", "VIC", "QLD" ... ]
>>>     >>>
>>>     >>> We also want a declarative solution that can be used by
>>>     input forms, so that if the user changes the country then the
>>>     states drop down list also changes. So relying on SPARQL queries
>>>     or so wouldn't solve our use cases either.
>>>     >>>
>>>     >>> The current proposal, based on the new keyword
>>>     sh:targetShape was
>>>     >>>
>>>     >>> ex:USAStateShape
>>>     >>>     sh:targetShape [
>>>     >>>         a sh:PropertyShape ;
>>>     >>>         sh:path ex:country ;
>>>     >>>         sh:hasValue ex:USA ;
>>>     >>>     ] ;
>>>     >>>     sh:property [
>>>     >>>         sh:path sh:state ;
>>>     >>>         sh:in [ "AZ" "CA" "FL" ... ]
>>>     >>>     ] .
>>>     >>>
>>>     >>> I believe the following is better overall:
>>>     >>>
>>>     >>> ex:USAStateShape
>>>     >>>     sh:target [
>>>     >>>         a dash:HasValueTarget ;
>>>     >>>         dash:predicate ex:country ;
>>>     >>>         dash:object ex:USA ;
>>>     >>>     ] ;
>>>     >>>     sh:property [
>>>     >>>         sh:path sh:state ;
>>>     >>>         sh:in [ "AZ" "CA" "FL" ... ]
>>>     >>>     ] .
>>>     >>>
>>>     >>> Where dash:HasValueTarget is a SPARQL-based Target Type
>>>     https://w3c.github.io/shacl/shacl-af/#SPARQLTargetType
>>>     >>>
>>>     >>> Implementations of SHACL-AF already will do the right thing
>>>     and will be able to do so efficiently. If you cannot use SPARQL
>>>     efficiently, your platform can simply hard-code this pattern,
>>>     just like you currently hard-code the common scenarios of the
>>>     proposed sh:targetShape property to avoid the bad default
>>>     performance. I expect the difference in your implementation
>>>     would be marginal, but we neither need to change the spec nor
>>>     open up SHACL to a feature that is very complex to implement
>>>     efficiently.
>>>     >>>
>>>     >>> The downside of using something like dash:HasValueTarget is
>>>     that it doesn't "cover" all possible use cases. Instead of
>>>     allowing arbitrary sh:targetShapes we limit this to hasValue
>>>     patterns. But those hasValue patterns were the main use cases
>>>     before we brainstormed that "it would be nice" to also support
>>>     various other shape types (sh:filterShape etc). hasValue
>>>     patterns are trivial to look up. If anyone needs additional
>>>     patterns, such as the one from the PR then they can be covered
>>>     by custom targets which may also get hard-coded by those that
>>>     cannot use SPARQL.
>>>     >>>
>>>     >>> BTW the case of
>>>     http://datashapes.org/constraints.html#HasValueInConstraintComponent
>>>     can be covered by having multiple dash:HasValueTargets with
>>>     different dash:objects. A bit more verbose but can reuse the
>>>     same machinery. If you have long lists, introduce your own
>>>     dash-like extension backed by a SPARQL query and hard-code
>>>     against that if performance isn't good.
>>>     >>>
>>>     >>> Sorry for moving back and forth on this topic, but getting
>>>     hands-on experience with an implementation revealed to me just
>>>     how bad the sh:targetShape solution would become. And I couldn't
>>>     schedule time for such an implementation earlier due to other
>>>     commitments.
>>>     >>>
>>>     >>> It would be useful to have input from other SHACL
>>>     implementers (there are about a dozen SHACL engines out there,
>>>     and counting). We really don't want to rush something through
>>>     which then becomes a burden for others.
>>>     >>>
>>>     >>> Holger
>>>     >>>
>>>     >>>
>>>     >>>
>>>     >
>>>

Received on Tuesday, 7 July 2020 07:09:37 UTC