Re: on entailments and triple terms from Peter F. Patel-Schneider on 2024-08-19 (public-rdf-star-wg@w3.org from August 2024)

From: Peter F. Patel-Schneider <pfpschneider@gmail.com>
Date: Mon, 19 Aug 2024 18:02:17 -0400
To: public-rdf-star-wg@w3.org
Message-ID: <97b00c48-eb24-454d-9d2b-137c5dc4baa2@gmail.com>

On 8/19/24 17:44, James Anderson wrote:
> good evening;
> 
>> On 19. Aug 2024, at 18:06, Peter F. Patel-Schneider <pfpschneider@gmail.com> wrote:
>>
>> It is indeed true that many SPARQL implementations do a poor job of optimizing queries that use the ontology facilities of RDFS.  It should be possible to run the query you provide at essentially the same speed as the version of the query that does not include subproperties on RDF graphs that have no relevant subproperty statements and with not much loss in speed on a graph that only has a few relevant subproperty statements (compared to running the simpler query on an RDF graph that has materialized the consequences of the subproperty statements).
> 
> i am curious, why one would make a broad claim on this order.
> sparql formulations of the sort which combined those sorts of patterns freely in large queries which targeted graphs of large subject and object cardinalities would appear to constitute a significant challenge to an optimizer.
> this even given the expressed content restrictions.
> 
> do you have any references to discussions about how one might in general optimize such a query type and/or benchmarks which demonstrate results on that topic?
> 
> best regards, from berlin,

I don't view this as broad claim.  It is essentially just a claim that keeping 
statistics is a good idea and that there is a way to exploit these statistics 
in the above situations.  More detail follows.

As far as I know it is generally useful for a query optimizer to keep 
statistics on each property.  These statistics are useful for a number of 
optimizations, in particular to be able to work on intermediate results that 
are likely to be small before working on results that are likely to be large. 
If these statistics show that determining the subproperties of a property is 
likely to be very cheap a good query optimizer would generally do that part of 
the query first.  If the results of these subqueries show that there are no 
non-trivial subproperties the properties can be substituted into the rest of 
the query.  So the overhead is the overhead of keeping the statistics - a good 
idea in general; consulting the statistics - which would be quick; running the 
query - which in this case is very quick; and doing the substitution - which 
is extremely quick here.  So no significant overhead.  If the results of the 
query show only one or two subproperties then the overhead is the difference 
between running a query for a small number of properties and combining the 
results vs running a query for a property of the larger size.  So it is 
possible to construct a query optimizer that handles these sorts of queries 
without much overhead.

peter

Received on Monday, 19 August 2024 22:02:24 UTC