Re: relational data as a bona fide member of the SM from Sampo Syreeni on 2011-11-25 (semantic-web@w3.org from November 2011)

From: Sampo Syreeni <decoy@iki.fi>
Date: Sat, 26 Nov 2011 01:45:10 +0200 (EET)
To: Frank Manola <fmanola@acm.org>
cc: Alexandre Riazanov <alexandre.riazanov@gmail.com>, Semantic Web List <semantic-web@w3.org>
Message-ID: <alpine.DEB.2.00.1111252350490.18419@lakka.kapsi.fi>
On 2011-11-03, Frank Manola wrote:

> From my point of view, a major reason for focusing on unary and binary 
> predicates (the logical forms that underlie RDF triples) is that it's 
> easier to deal with the problems of integrating heterogeneous data (a 
> key issue in the semantic web) if the data is in (or is mapped to 
> being in) that form, as opposed to data in arbitrary arity relations 
> [...]

(I'm pretty sure RDF doesn't much mind about unaries.)

>From my angle n-ary relations is not much of a problem. It's just that 
it's easier to do propositional and predicate logic with (implicitly) 
variable arity relations. (The arity then comes from the number of 
preceding quantifiers.) With no type theory in between, you don't really 
see the difference between subject-predicate-object relations, and the 
ones where you have an n-ary relations between a (sometimes composite) 
key and a number of functionally related attributes (i.e. a potential 
ton of functionally dependent ones), that's precisely what happens in 
RDF as well. The only real difference is that RDF insists upon having a 
single field key, which is then owl:equal to the composite key. And when 
a blank node, then hey, it's suddenly a surrogate key in RM terms.

> (for example, with n-aries you need a schema to interpret any tuples 
> you encounter "in the wild", otherwise you don't know what the 
> "columns" mean).

True. RM in its basic form cannot handle either of incomplete knowledge 
or de novo, unexpected schema. Because of that, it tried to handle both 
of those in its later versions, in its own way. Incomplete schema by way 
of nulls (which have been reinterpreted successfully as "relation 
marks"), and even earlier by exposing the schema as well as relations, 
so as to make relational manipulation of the schema possible as well.

Obviously no RDBMS makes it natural to implement all of the data models 
RDF admits. Even the underlying relational model does not, I grant you 
that. But then on the other hand, RM was designed from the start to be 
something like description logics: tractable at the cost of not being 
fully universal. In the case of DL the governing principle was logical 
consistency for reasoning systems. In the case of RM it was practical 
realizability of 1970's transactional production databases, coder 
productivity on top of them as against the CODASYL mess, and in all a 
limitation to "something we can already solve". RM started out as a very 
pragmatically minded business, even if it took its while to come into 
fruition, you know.

> one of the major approaches to developing the mappings between the 
> various relational schemas was by interpreting the various local 
> schemas in terms of unary and binary relations for just this reason 
> [...]

Been there, done that. And you know, at the time, I was rather amazed to 
discover that Ora Lassila, as one of my own firm, had been influential 
in getting the first RDF draft through. That was in fact before I had 
really delved into relational theory myself.

> (compound keys had to be dealt with in this way too, because the same 
> combinations of columns didn't necessarily constitute the keys in 
> otherwise corresponding relations in the different local schemas).

My thinking about both single and compound keys is that, actually, you 
need object identifiers to make them come together. This is full-blown 
heresy for a relational guy like me. But still, hear me out.

A key is a key, no? It's supposed to be used as a unit, always and 
everywhere. Thus, the domain of keys in all is fully isomorphic to the 
domain of object identifiers from the ORM side of the picture. It's just 
an identity which can be used wherever and whenever to refer to an 
object. The relational model then has its counterpart to this kind of a 
thing: the surrogate key.

RM says you're never supposed to relinquish the internal value of a 
surrogate key to the user of a database. Just as you wouldn't divulge 
the physical address of a blank RDF node to anybody, ever, over any 
interface. At most you'd let them operate on them using a query 
language, and by extension, you'd make *darn* sure that even if you 
broke the separation of concerns/protocol layers by passing on something 
lower-minded, you'd never ever take responsibility for what happens if 
people started relying on their embedded semantics. And so on.

I hope you can start so see how similar RM and RDF actually are. The 
only real differences happen because of a) fixed versus variable arity 
(just make each RDF attribute an object of the key as subject, with 
predicate naming the column), b) the local versus global naming of the 
objects (this is where RM falls short and RDF reigns supreme; so why not 
name things within the RM by URIs as well?), and c) RDF can talk 
anything about anything.

The last part is a bit tricky. In my mind it tells me that even within 
RM, every single "row" of a relation ought to have a synonym of sorts, 
which attaches to one or more of its keys. Sort of like an OID, or a 
surrogate, synthetic key, which maps 1-1 to all of the relation's inborn 
keys.

I mean, in that way, it's not too difficult to see how URIs can 
variously be attached or not to every single relation one can imagine. 
In full accordance with RDF, on the other hand. Why we can have a full 
isomorphism between the triple/RDF/EAV model and the relational model, 
on two different hands.

Do give me a counter-example or minimal pair, since I'm prolly making 
myself too clear right now.

> Mind you, if you're NOT worried about integrating heterogeneous data, 
> RDF introduces extra pain of its own (figuring out all those 
> identifiers, for one thing), but if you ARE worried about integrating 
> heterogenous data, I think you want those identifiers around.

To amplify, we *definitely* want those identifiers around. My above 
analysis was mostly about how RM and RDF could be fit together in the 
most natural way. It didn't say anything about the value of publicly 
shared keys, which is what I think RDF's URIs are. And what RM sadly 
lacks.

> I don't quite understand your argument. Indeed, interoperability is 
> the target. Syntactic interoperability is not a problem as long as you 
> use the same or convertible syntaxes.

I'm something of a pragmatist myself. I wouldn't want to have to parse 
RDF/XML myself, for example. Even NTriples. I'd rather like a parser 
which doesn't have to deal with whitespace, even. Something that could 
be proven to be correct, even, which is very, very hard if there are any 
alternatives in the syntax of the parsed language/protocol.

> Semantic interoperability requires shared understanding of the 
> identifiers being used, which has nothing to do with arity.

There we're in full mutual understanding. Thus I'll refrain from 
commentin on that point from now on.

> Reinterpreting legacy relational schemas is a related, but separate 
> issue. Binary predicates are often handy to represent attributes, but 
> it does not mean n-ary predicates cannot be helpful in the same 
> (although I could not recall a real example) and other KR tasks.

At the same time, I've painted the triple/EAV/RDF representation of 
n-ary relations as a sort of a reification already. From the 
mathematical logic point of view, that is far from nontrivial. And I 
sort of think many of the W3C standards around RDF have become much too 
difficult for common consumption by programmers, precisely because of 
this divide.

That's not a happy circumstance to me, because I'd like the Semantic Web 
"just fucking work". So to speak.

> The original question (I thought) was why there weren't relational 
> approaches applied in Semantic-Web-like contexts (where, as you say, 
> interoperability is the target).

Yes, I think so. Though don't ever think it's the only point (I don't 
think your point would be anymore simplistic.)

> I cited the integration of heterogeneous relational databases to argue 
> that, in this case, where relations were already being used by all 
> parties, and interoperability was the target, those doing the 
> integration found that using unaries and binaries helped [...]

Actually that is then one case where relational theory long preceded 
RDF, despite its lack of shared identifiers. (Which, mind me, would have 
made the effort so much easier.) There is an entire literature of 
relational model mapping, which eventually landed at restricted second 
order logic, as the model for that sort of thing which eventually 
closes upon itself.

(And no, most of the dependency or query theories don't close as neatly. 
Template dependencies seem to, but then there's no efficient realization 
of them. Unlike indices for inclusion dependencies, in foreign keys...

My eventual point is that RM theory is well-advanced, as is that of 
description logics the like. I'd like to see some cross-fertilization 
instead of mutual bickering, for a change.)

> (I agree that shared understanding of the identifiers is necessarily 
> for semantic interoperability, but in RDF+OWL, at least the 
> identifiers are *there*; hose putting the data on the Web had to 
> create them).

Fully agreed. That is one of the novel innovations of the distributed 
semantic web. TimBL would prolly agree, as per his early writings on the 
topic.

> All that RDF is doing is starting from the unaries and binaries.

Where's the unary, by the way? I'm only seeing ternaries, and not even 
binaries. What binaries there are, are named ones, and thus actually 
ternaries.

For example, it's very difficult to express in RDF one of the basic 
ideas presented in Codd's RM/T. That is, the unary statement that "we 
have this thing called <x>". Without at the same time saying anything 
more about it, like "<x> is also <class-wise==y> related to the class 
<z>".

If you look at RM/T, or even RM/V2, you can see definite connections 
towards object-relational-mapping. Which I've adopted fullsale in my own 
relational modelling discipline. And then some. (I've actually 
constructed counter-examples in working relational databases towards 
Codd's original vision.)

But even when I can work RDF at the same time, many of the natural 
constructions I use in that sort of work, don't seem to translate 
naturally into the ternary which RDF relies upon.

> Nor is it an argument that you can't do semantic integration using 
> n-ary relations.

I again repeat: certainly not. On the contrary I think there is a 
natural isomorphism between the two models, there. It's just that, 
somehow, people don't see it too clearly, and could then perhaps benefit 
from reading more closely both the underlying theory of RDF/EAV/DL, and 
RL/FOPL as well.

> I think it's *easier* to do that integration with the RDF approach, 
> [...]

It absolutely is. But then it's also much less efficient and clean to do 
something with the integrated data. Unless you turn it back into RM. If 
you do, we're in perfect harmony. If you don't, I'll bet you or the 
fellow who inherits your integration framework will be in a world of 
hurt. :)

For the most part I've used EAV/CR as an example of a middle-ground. 
Because one of the papers having to do with it seemed to say: "keep the 
funky stuff as EAV, put the rest in proper relational tables, and then 
use metadata and a middleware solution to distinguish between the two". 
Now I'm not so sure whether they go with that sort of a sensible 
solution anymore.

> There have certainly been attempts to provide more general KRs 
> (allowing n-ary predicates) for data/knowledge exchange; [...]

N-ary is hard. So then let's do what mathematicians do so well: let's 
define a full isomorphism between n-ary and what we already know how to 
handle in DL, so as to reduce (at least part of) the problem to what we 
already know. That will reveal many aspects of RM as well to be at a 
reified level which escapes the formalism, as well. So be it. But let's 
at least do that, no?

> Perhaps someone with more experience with those languages can chip in 
> here (Pat?) and cite their experiences in using them to integrate 
> large amounts of data, [...]

Absolutely. Though now I have three separate Pat's in mind. One whom 
I've bumped heads against in the past. The one you prolly refer to. And 
then the sadomasochist, gender changing guru who I've truly never had 
the privilege of meeting even online. ;)

> This isn't a pragmatic vs. theoretical issue, it's a question of what 
> problem you're trying to solve. DF is based on the open world 
> assumption because it's designed with the Web in mind, and the Web, 
> unlike a relational database, is open.

My real problem is that it could be based on both. Via explicit 
metadata. It could be just a syntax for expressing, among other things, 
that certain things are to be taken with an open and other ones with a 
closed syntax. Yet it's fixed as being open, no matter what.

That then means that certain kinds of relational databases (my 
favourite; my job) can't be expressed in it at all. Or if I'm mistaken, 
how do you express the kinds of closed world semantics I usually work 
with, in RDF?

> I don't have a problem with the OWA in general. The problem is the OWA 
> is there even when you don't want it, specifically when you want to be 
> able to specify a piece of data completely and unambiguously. With 
> OWA, you cannot compute the length of a list because somebody else can 
> redefine the list somewhere.

Yes. Though, perhaps it's just suitable that your assertion of a closed 
world within the syntax might not be believed. Or might, then. There 
perhaps the biggest problem is that the trust portion of the layer pie 
hasn't developed as rapidly as it could/should have.

In sum, I think we're on the same tracks. On more than one topic. Me 
likes.
-- 
Sampo Syreeni, aka decoy - decoy@iki.fi, http://decoy.iki.fi/front
+358-50-5756111, 025E D175 ABE5 027C 9494 EEB0 E090 8BA9 0509 85C2
Received on Friday, 25 November 2011 23:45:42 UTC