Re: PG mode and SA mode from Richard Cyganiak on 2019-09-23 (public-rdf-star@w3.org from September 2019)

From: Richard Cyganiak <richard@cyganiak.de>
Date: Mon, 23 Sep 2019 22:34:51 +0100
To: Steve sarsfield <steve.sarsfield@cambridgesemantics.com>
Cc: public-rdf-star@w3.org
Message-Id: <38D3A5E9-AAFA-42C0-A2A9-BB33F2DC8A10@cyganiak.de>

Thanks, Steve. In summary, since RDF Reification uses several triples to represent a single annotation, it requires more join operations and complex join patterns.

Thinking out loud -- not a question to you specifically...

It is interesting to note that in the early days of RDF, some databases had optimisations for RDF Reification. For example, the venerable Jena ModelRDB used a separate table to store reified statements. These designs fell out of fashion when SPARQL appeared on the scene. Named graphs addressed many (but not all) of the use cases for RDF Reification in a better way. And there was little appetite among vendors for supporting both SPARQL's Named Graphs and RDF Reification. This still seems to be the case today.

I wonder whether there is much difference in implementation cost between (i) dedicated optimisations for RDF Reification, and (ii) support for RDF*/SA. Both probably require a separate index for the reified/nested triples. The difference is mainly whether the triple ID is internal to the database (RDF*/SA) or can be accessed/managed by the application (RDF Reification).

I don't doubt that an efficient RDF*/PG implementation is much simpler.

Richard

> On 23 Sep 2019, at 18:30, Steve sarsfield <steve.sarsfield@cambridgesemantics.com> wrote:
> 
> >>Could you expand a bit on this? What is it about reification that creates these performance issues?
> >>Is it something that is inherent to the design of RDF Reification, or is it something about the way it is generally implemented?
> 
> Distributed RDF graph databases shard their data such that all of the triples associated with a given subject reside on the same node. In LPG terms, this means that "whole vertexes" are stored on nodes. This is done to minimize network communications during typical query processing because network interconnects are nearly two orders of magnitude slower than main memory, let alone cached memory. The low network throughput/latency, as compared to local memory, is a primary driver in cost-based query planners, they target minimal traffic over the interconnect and perform traversal/join operations that minimize such traffic.
> 
> Analytic queries that reference multiple properties of that same subject ("same-subject joins") are thus optimized and only subject/object (or object/object) joins/traversals move significant data over the interconnect. With reified data, these same-subject joins become subject/object (and/or object/object) network intensive. Reification uses more triples to model the same information as a single RDF* triple.  This has storage implications and performance implications -- requiring extra JOINs to get at the information.
> 
> Thanks,
> 
> Steve
>

Received on Monday, 23 September 2019 21:35:42 UTC