RE: Sync'ing triplestores from Joshua Allen on 2005-02-08 (semantic-web@w3.org from February 2005)

From: Joshua Allen <joshuaa@microsoft.com>
Date: Tue, 8 Feb 2005 12:18:39 -0800
To: "Danny Ayers" <danny.ayers@gmail.com>
Cc: "Giovanni Tummarello" <giovanni@wup.it>, <semantic-web@w3.org>
Message-ID: <0E36FD96D96FCA4AA8E8F2D199320E5204353032@RED-MSG-43.redmond.corp.microsoft.com>

> btw, one thing I had considered was cheating and using RDBMS-backed
> (same toolkit) stores, and keeping them synchronized underneath. It
> seems MySQL 5+ will support the kind of multi-master replication I'm
> looking for, but right now there's only seems to be
> master-(readonly)slave. I suspect a fairly naive algorithm on top of
> SQL might give acceptable performance, but I doubt whether it would be

Yes, exactly.  Note that people did multi-master replication for a long
time before it was supported out of the box, and the code that ships
with the DBMS is just repackaging the scripts people used to write.
Implementing it is easy; the hard part is deciding what you want the
rules for merging to be (define primary keys; multivalue vs. single
value; and conflict resolution rules).

With RDBMS you can make assumptions based on schema -- if a row with
existing PK is added, you replace.  If a row under a PK-->FK
relationship is added (with new key on foreign side), you add new.  And
that's about as complicated as it gets (except for conflict resolution).

The difference with RDF is that there is no such thing as a primary key
for a triple (other than to treat the contents of the whole thing as a
key).  It's the same problem as doing merge replication on an RDBMS with
no primary keys -- you have to use the whole tuple as a key, and then
merge replication works, but it is very likely not the results that
anyone would find desirable.  You just get a mess.

So the real problem of merging RDF stores is in being able to uniquely
identify chunks of RDF independent of their full content.  It seems the
options here are very limited, without going into some crazy "merge
definition language" in the RDFS or OWL.  Even if you have a simple
RDFS/OWL property which tells you a combination of child tuples which
uniquely identify the graph, you still have the problem that an update
replaces the entire graph; when you probably want it to merge only
properties that have been changed (if I update only the e-mail address,
and send a graph that has the old postal address, I do not want my
update to replace the current postal address).  So to accomplish this,
you need a delta encoding syntax with change tracking (send a statement;
"update the following triple on the node identified by this key, and
ignore everything else under that node").  Basically a DML for RDF.  To
expect all stores to support change tracking and a standardized DML is
pretty crazy.  We don't even do that in SQL land.

Received on Tuesday, 8 February 2005 20:19:12 UTC