RE: Sync'ing triplestores from Joshua Allen on 2005-04-04 (semantic-web@w3.org from April 2005)

From: Joshua Allen <joshuaa@microsoft.com>
Date: Mon, 4 Apr 2005 11:03:28 -0700
To: Bill de hÓra <bill.dehora@propylon.com>, <semantic-web@w3.org>
Cc: "Danny Ayers" <danny.ayers@gmail.com>
Message-ID: <0E36FD96D96FCA4AA8E8F2D199320E5204B3EA74@RED-MSG-43.redmond.corp.microsoft.com>

 
> to do this seems to be for stores to expose a triples feed. 
> That is, a store would publish all new deletes, updates and 
> inserts as a data stream. That way, any other store's agent 
> can subscribe to the feed.

> maybe for ever). It might lack the precision those coming 
> from the enterprise database background would expect or 
> insist upon, but there is a history of failure in regard to 

Yes, I agree 100% with the whole message.

Now as for chosing the right level of richness to support in the feed...  Actually, coming from an enterprise database background, I would point out that the idea of a feed with insert/delete (update is just a delete/insert) is a traditional workhorse for fault tolerance and load-balancing.  This is how replication and logshipping work. And as you pointed out, it's similar to how NNTP and RSS work, although deletes are rare in NNTP and impossible in RSS. 

Such a design could quickly become too complicated for easy implementation, but I think we can avoid that.  Some design points I'm thinking of (using terminology "server" for the source replica and "client" for destination replica):
1) How far back to keep changes?  If clients could ask for changelist "since a certain timestamp", the servers would have to save the changelist for that amount of time.  This, IMO, is too complicated.  Servers are unlikely to follow consistent rules on retention, and forcing the client to maintain client-side last-version timestamp and pass it in the querystring is a high bar.  So I would propose that retention be left up to each server, a number of days or a number of entries, just as with RSS.  I would also propose that we don't bother supporting "changes since", or at least make it optional.  Clients would just get the dump of "all changes in the last n days", even if their copy is relatively fresh.
2) Need a feed with full dump.  For some data like blog entries, comments, and newsgroup entries, there is no compelling need to make sure you have a full replica at all times.  But for other data it could become important.  The easiest way to support this is to require two feeds.  One with a full replica, and one with "changes since", as described above.  Then, when new clients want to get fully up-to-date, or when old clients want to flatten and start over (perhaps because they missed some entries), they could use the full dump to bootstrap.
3) No batching, transactions, etc.  There are good reasons to support extra functionality, since they facilitate better data integrity and some additional scenarios.  But they would kill adoption.

As for identifying the triples, I would just let each quad (context+triple) identify the triple.  The context would  be the URI for the source feed, perhaps.

Received on Monday, 4 April 2005 18:30:33 UTC