Additional info on proposal for cwm module to load RDBMS data from Jones, David H on 2004-10-29 (public-cwm-talk@w3.org from October to December 2004)

From: Jones, David H <david.h.jones@boeing.com>
Date: Fri, 29 Oct 2004 11:35:27 -0700
To: <public-cwm-talk@w3.org>
Cc: "naudts guido" <naudts_vannoten@yahoo.com>
Message-ID: <80A0914E4AFDE145B3888E6A107CF65A044DC81F@xch-nw-05.nw.nos.boeing.com>

This emails give additional background and information on a proposal for an interface between cwm and RDBMSs. In addition I will compare this proposal with ideas contributed by Guido Naudts, which I believe suggest a tighter integration.

Motivation:

The motivation for the proposal for a cwm module to load RDBMS data is an interoperability scenario where there are many heterogeneous data sources storing related information. This information is used in a variety of business processes that need to combine data from different source, perform some reasoning/calculations, make some decision, and possibly update one or more of the data sources.

The goal of the RDB proposal is to support the loading of rdb records into the cwm triple store. Once loaded into the store, various things could be done:
- Save as n3/rdf for publishing purposes
- Translate portions of the store to conform to one or more external ontologies.
- Do general reasoning to support semi-automated task execution
- Explicitly update the database by doing sql insert/update operation.

The data loaded in cwm is essentially a snapshot of the data source, and there is not effort to synchronize data between loads.

There are obvious limitations in the size of the data that could be loaded into an in-memory triple store. It is assumed that it is the user's responsibility to load data within the constraints of their computer.

In the next section I try to contrast differences between what Guido and I are envisaging:

- I am assuming that a person using this builtin would want to see the rdb data as instances of one or more classes. The user provides the class name to handle cases where the query has a join.

- My proposal creates property names by concatenating class name and column name. This handles collisions where two tables may have the same column name (a rather common occurrence). It is also possible to have identical table name/column name in different schemas of the same database. This could be handled in 2 ways:
- Prepend the schema name to the class and column name
- Create a different connection with a different base uri.
Since this duplication in different schemas would be the exception rather than the rule, I would suggest the 2nd choice.

- My proposal is intended to support loading of the current triple store from rdb sources and (possibly) explicitly updating rdb sources from the triple store. I believe Guido is suggesting having an alternative rdb implementation of the RDFStore, similar to Jena.
- In my proposal a URI is generate for each instance, based on the PK for the query. This approach is somewhat restrictive, but produces stable URIs which can be used for graph superposition and classEquivalence statements. I believe Guido is suggesting creating anonymous triples when triples are loaded from the rdb.

- I am proposing using rdf/rdfs constructs to make results processible by a wider range of tools, and because owl constructs don't seem to be required. Guido is proposing to use owl constructs.

I actually am not sure if database update is a reasonable goal. Ideally this could be done with transaction management, so consistency could be guaranteed when updating multiple databases. This seems like an unnecessarily complex feature in an experimental tool like cwm. As an alternative, we could consider an update with no transaction management, or simply defer implementation of any update until a more compelling case is made for it.

In summary, my proposal has a limited scope with rather specific - and limited -- use cases. I am assuming that no changes would be necessary to the internals of cwm. The proposal of Guido would implement a rdb triple store and support reasoning across triple stores. This would be a fairly tight integration of cwm and RDBMS. It is unclear to me if his proposal includes dynamic queries to a separate database.

-----------------------------------------------------------------------------
Example (with slight modification from previous email):
Command line:
Cwm rdb.n3 rdb-test.n3 --think > rdb-results.n3

<<rdb.n3>> <<rdb-test.n3>> <<rdb-results.n3>>

Regards,

David H. Jones
Boeing Phantom Works,
Mathematics & Computing Technology
425-865-6924
425-865-2964 (FAX)
david.h.jones@boeing.com

Attachments

application/octet-stream attachment: rdb.n3
application/octet-stream attachment: rdb-test.n3
application/octet-stream attachment: rdb-results.n3

Received on Friday, 29 October 2004 18:36:01 UTC