RE: plural vs singular properties (a proposal)

On 2008-01-07, tim.glover@bt.com wrote:

> A *triple store* is already a perfectly good, general purpose 
> relational model. The key is the whole triple. It represents entities 
> which can have multiple properties each with multiple attributes. All 
> triple stores are valid relational models.

As somebody who comes from a relational background, I'd want to add that 
this is not quite the whole story. In relational circles, this sort of 
design is called an entity-attribute-value, or EAV, model, and it's a 
uniformly contentious design choice. The reason why it has been chosen 
for RDF and why it regularly comes up in relational schemata is that 
it's completely general, so that it enables us to handle semi-structured 
data whose precise structure we do not know beforehand. This is for 
example what enables RDF to be merged. But the design also exacts a 
price in performance, integrity and semantic precision.

The first part is seen when lots of related data is lifted from an RDBMS 
in the EAV form -- usually we end up with hugely complicated self-joins 
which clearly underperform the properly normalized equivalent. The 
second part is easily seen in the fact that any relational schema, be it 
normalized or not, can be cast in EAV, and in that properly defining 
integrity constraints for all of the different entity types encapsulated 
in the generic EAV representation is all but impossible to get right. As 
for the final part, if you really want to be able to handle any data 
whatsoever, obviously you cannot depend on any particularities but only 
the highest level, common structure. That doesn't really enable anything 
more than merging, not really a full-blown application in itself.

When we deal with binary relations with no extra constraints only, the 
representation is still generally sound. As you say, the problem comes 
when we try to represent dependencies like uniqueness, single-valuedness 
or the obligatory presence of an attribute, but also in the 
representation of higher arity relations. To me the last part seems like 
the nastiest one. Suppose we have a ternary relation that cannot be 
non-loss decomposed, like the prototypical lecture one used in all of 
the normalization tutorials. It is keyed by room, lecturer and time and 
might contain additional attributes pertaining to the lecture as a 
whole, like its subject. In RDF you'd in effect be forced to represent 
the relation as a number of binary ones, perhaps utilizing blank nodes 
as surrogate keys standing for a single lecture, and adding quite a lot 
of extra semantics to the model to capture the implications of what lies 
beneath. Worse yet, the representation would not really look quite the 
same as the "native" one available to binary relations; extra code would 
clearly be needed to handle the higher arity case. Relational practice 
frowns upon that sort of thing, preferring to leave the relation as it 
is and using relational algebra to handle all of the data uniformly.

In a sense that suggests to me that EAV is a means of reifying data. 
When we use it for higher arity relations, we're stepping once up the 
ladder and using triples to talk about relations representing data, not 
the underlying data itself. That means that we have to augment our 
applications to unreify the data on the fly, which then causes the 
performance and integrity implications. That sort of reasoning then 
suggests that RDF stored in a relational database ought to translate 
from RDF semantics to the relational one and preferably aim at a 
well-designed, conventional, relational schema, instead of a pure triple 
store. All of the data needed to do so is also present as soon as we 
have some concrete application in mind, instead of just the generalized 
description base RDF offers.

> Eg an entity with unique values for all its properties can be 
> represented by a single relation with one row per entity and one 
> attribute per property.

Another thing is that the distinction between entities and relationships 
is not really part of the relational model. The above lecture example 
already shows how: usually under RM, you wouldn't try to explicitly 
represent lectures as entities eventhough they have their own, 
independent attributes. Under RM, everything is just a relation between 
some atomic values and what really governs schema design are data 
dependencies, not intuitions related to entities, classes or similar 
semantic constructs.

> The semantics of these models are lost in translation to triple 
> stores, because the uniqueness constraints are lost.

That is correct, though.

> *RDF* on the other hand has additional semantics for specific 
> properties, rdf:type, rdf:subPropertyOf etc, which are not part of the 
> relational model.

Under RM, those would be expressed as data dependencies, leading to 
integrity constraints. Most often they take the form of inclusion 
dependencies, which are then implemented as foreign key constraints. 
Existing RDBMSes tend to be a bit limited in how far their constraint 
mechanisms carry for this sort of thing, but that's already separate 
from theory.
-- 
Sampo Syreeni, aka decoy - mailto:decoy@iki.fi, tel:+358-50-5756111
student/math+cs/helsinki university, http://www.iki.fi/~decoy/front
openpgp: 050985C2/025E D175 ABE5 027C 9494 EEB0 E090 8BA9 0509 85C2

Received on Monday, 7 January 2008 16:20:36 UTC