- From: Sampo Syreeni <decoy@iki.fi>
- Date: Mon, 7 Jan 2008 18:20:23 +0200 (EET)
- To: tim.glover@bt.com
- cc: garret@globalmentor.com, andrewfnewman@gmail.com, fmanola@acm.org, semantic-web@w3.org
On 2008-01-07, tim.glover@bt.com wrote: > A *triple store* is already a perfectly good, general purpose > relational model. The key is the whole triple. It represents entities > which can have multiple properties each with multiple attributes. All > triple stores are valid relational models. As somebody who comes from a relational background, I'd want to add that this is not quite the whole story. In relational circles, this sort of design is called an entity-attribute-value, or EAV, model, and it's a uniformly contentious design choice. The reason why it has been chosen for RDF and why it regularly comes up in relational schemata is that it's completely general, so that it enables us to handle semi-structured data whose precise structure we do not know beforehand. This is for example what enables RDF to be merged. But the design also exacts a price in performance, integrity and semantic precision. The first part is seen when lots of related data is lifted from an RDBMS in the EAV form -- usually we end up with hugely complicated self-joins which clearly underperform the properly normalized equivalent. The second part is easily seen in the fact that any relational schema, be it normalized or not, can be cast in EAV, and in that properly defining integrity constraints for all of the different entity types encapsulated in the generic EAV representation is all but impossible to get right. As for the final part, if you really want to be able to handle any data whatsoever, obviously you cannot depend on any particularities but only the highest level, common structure. That doesn't really enable anything more than merging, not really a full-blown application in itself. When we deal with binary relations with no extra constraints only, the representation is still generally sound. As you say, the problem comes when we try to represent dependencies like uniqueness, single-valuedness or the obligatory presence of an attribute, but also in the representation of higher arity relations. To me the last part seems like the nastiest one. Suppose we have a ternary relation that cannot be non-loss decomposed, like the prototypical lecture one used in all of the normalization tutorials. It is keyed by room, lecturer and time and might contain additional attributes pertaining to the lecture as a whole, like its subject. In RDF you'd in effect be forced to represent the relation as a number of binary ones, perhaps utilizing blank nodes as surrogate keys standing for a single lecture, and adding quite a lot of extra semantics to the model to capture the implications of what lies beneath. Worse yet, the representation would not really look quite the same as the "native" one available to binary relations; extra code would clearly be needed to handle the higher arity case. Relational practice frowns upon that sort of thing, preferring to leave the relation as it is and using relational algebra to handle all of the data uniformly. In a sense that suggests to me that EAV is a means of reifying data. When we use it for higher arity relations, we're stepping once up the ladder and using triples to talk about relations representing data, not the underlying data itself. That means that we have to augment our applications to unreify the data on the fly, which then causes the performance and integrity implications. That sort of reasoning then suggests that RDF stored in a relational database ought to translate from RDF semantics to the relational one and preferably aim at a well-designed, conventional, relational schema, instead of a pure triple store. All of the data needed to do so is also present as soon as we have some concrete application in mind, instead of just the generalized description base RDF offers. > Eg an entity with unique values for all its properties can be > represented by a single relation with one row per entity and one > attribute per property. Another thing is that the distinction between entities and relationships is not really part of the relational model. The above lecture example already shows how: usually under RM, you wouldn't try to explicitly represent lectures as entities eventhough they have their own, independent attributes. Under RM, everything is just a relation between some atomic values and what really governs schema design are data dependencies, not intuitions related to entities, classes or similar semantic constructs. > The semantics of these models are lost in translation to triple > stores, because the uniqueness constraints are lost. That is correct, though. > *RDF* on the other hand has additional semantics for specific > properties, rdf:type, rdf:subPropertyOf etc, which are not part of the > relational model. Under RM, those would be expressed as data dependencies, leading to integrity constraints. Most often they take the form of inclusion dependencies, which are then implemented as foreign key constraints. Existing RDBMSes tend to be a bit limited in how far their constraint mechanisms carry for this sort of thing, but that's already separate from theory. -- Sampo Syreeni, aka decoy - mailto:decoy@iki.fi, tel:+358-50-5756111 student/math+cs/helsinki university, http://www.iki.fi/~decoy/front openpgp: 050985C2/025E D175 ABE5 027C 9494 EEB0 E090 8BA9 0509 85C2
Received on Monday, 7 January 2008 16:20:36 UTC