Re: SPARQL EXISTS

good morning;

it is apparent from the discussion last week that the working group feels that addressing problems with exists is within its charter.
i have thought for years about exists and lateral joins, just never in what i thought was the context of a recommendation.
if that is now the situation, the working group should allow adequate opportunity to consider the following:

- proposal 1 (issue #156 - improved substitution) describes a mechanism which, while it is not quadratic, is linear in the size of the dominant solution field times a constant corresponding to the execution effort of the filter select form. where the filter clause involves just comparisons, the factor is small enough that its effect is negligible.
where the filter involves exists, the situation changes.
in the sparql recommendation text passages and in examples appearing in discussions of corrections to the exists operator definitions, the argument to exists is frequently just a single statement pattern.
the intent, however is to permit an arbitrary subselect (see https://www.w3.org/TR/2013/REC-sparql11-query-20130321/#rExistsFunc).
where the clause is just a single statement pattern, the factor will be large enough to be notable with large dominant solution field cardinalities.
where the clause is a subselect, the execution effort is likely to be large enough that the combination of the filter with the dominant solution field will exhibit abysmal performance. 
dydra has implemented exists based on dynamic variable binding since well before any discussion of changes to the definition.
the revised substitution mechanism is not unrelated in its execution complexity.
we have observed enquiries from customers into why queries were timing out.
when we investigated, it was the case that the dominant solution field cardinality was several million.
the consequences were evident.
we would expect the substitution approach to exists to exhibit similar behaviour.
this argues against it.

- proposal 2 (issue #156 semijoin/anti-join) should be considered more thoroughly in the context of extending sparql to include lateral joins.
while some form of join operation will be necessary to effect the correlation between the dominant and dependent solution fields with adequate performance, the mechanism described as proposal 2 reads as if it would not accomplish this in a performant manner.
some form of lateral join will be necessary to accomplish this.
that will require adequate time to explore its alternatives.

- the substitution mechanism from proposal 1 should be considered more thoroughly in the context of query parameterisation

- the absence of a lateral join form is a much more serious language deficiency than inconsistencies in the exists results among implementations.
considering the frequency of appearances of exists it is a question, whether it would be better to first define a lateral join mechanism and then implement exists in terms of that.
the prevalence of exists in queries in our service over the past years is less than one in a thousand.
in contrast, from the comments to sparql-dev issue concerning correlated subqueries (https://github.com/w3c/sparql-dev/issues/100) running from 2019 to this year,  it appears that at least five implementations have incorporated this operation in some way, despite that there is no recommended definition.
unfortunately, the only notion of conformance is limited to the existence of tests suites from jena and oxigraph.

- rdf4j (https://github.com/eclipse-rdf4j/rdf4j/issues/4315)
- oxigraph (https://github.com/oxigraph/oxigraph/issues/267)
- jena (https://jena.apache.org/documentation/query/service_enhancer.html, https://github.com/apache/jena/issues/1615)
- stardog (https://docs.stardog.com/query-stardog/stored-query-service#correlated-subqueries)
- dydra (https://observablehq.com/@datagenous/correlated-subqueries-in-sparql)

that could be construed to a indicate a much more significant market demand for a lateral operator, than for a corrected definition for exists - especially if the latter would follow from the former.

---
james anderson | james@dydra.com | https://dydra.com

Received on Tuesday, 8 October 2024 01:33:26 UTC