- From: Eric Jain <Eric.Jain@isb-sib.ch>
- Date: Wed, 31 Mar 2004 14:47:03 +0200
- To: "rdf-interest" <www-rdf-interest@w3.org>
Don't know whether the list of strategies for storing RDF data in a relational database at http://www-db.stanford.edu/~melnik/rdf/db.html is still updated? Anyways, here's yet another (simplistic) approach: CREATE TABLE statements ( id BIGSERIAL PRIMARY KEY, model_ns SMALLINT NOT NULL, model TEXT NOT NULL, statement_ns SMALLINT, statement TEXT, subject_ns SMALLINT, subject TEXT NOT NULL, predicate_ns SMALLINT NOT NULL, predicate TEXT NOT NULL, object_ns SMALLINT, object TEXT, object_string TEXT, object_number DOUBLE PRECISION, object_boolean BOOLEAN, generated BOOLEAN ) WITHOUT OIDS ; CREATE TABLE namespaces ( id SERIAL PRIMARY KEY, name TEXT UNIQUE NOT NULL ); CREATE INDEX statements_model_idx ON statements(model_ns, model); -- Optional, for speeding up retrieval of models from specific data sets: CREATE INDEX statements_foo_idx ON statements(model_ns, model) WHERE model_ns = '1'; -- The following indexes are only required for running queries: CREATE INDEX statements_statement_idx ON statements(statement_ns, statement); CREATE INDEX statements_subject_idx ON statements(subject_ns, subject); CREATE INDEX statements_predicate_idx ON statements(predicate_ns, predicate); CREATE INDEX statements_object_idx ON statements(object_ns, object); CREATE INDEX statements_object_string_idx ON statements(substr(object_string, 1, 128)); CREATE INDEX statements_object_number_idx ON statements(object_number); Notes: - Works with Postgres. Can easily be ported to MySQL if merge tables are used instead of partial indexes. Doesn't work with databases that do not allow arbitrary sized text fields (Oracle et al). - The 'id' column allows the original order of statements to be preserved. While this isn't strictly required by the RDF data model, users tend to find it unacceptable to have their data randomly reshuffled every time they access it (even if there is no real meaning in the ordering). - model_ns, model identify collections of related statements, or subgraphs. The namespace of a model corresponds to a database, whereas the name is a database entry. While queries are usually run over the complete data set, most of the storing and retrieving is done at the level of individual entries or databases. - statement_ns, statement: Used for reification. Corresponds to an rdf:ID attribute on a property. - object_ns, object: Set if the object of a statement is a resource. - object_string: Set if the object of a statement is a literal. - object_number, object_boolean: Additional fields for querying literals. - implicit: Set to true if the statement was inserted based on information found in the schema. Example: rdf:type statements with superclasses, or equivalent classes. Statements with generated = true are usually ignored when retrieving data, but are useful for queries. (Currently I don't actually use this, until I have found a way of efficiently generating those implicit statements.) - I dare claim that this approach is about as fast as is possible for storing and retrieving RDF statements with a relational database (other schemas may be more efficient for querying, hard to say). It is possible to load over 120M statements in a matter of hours. (Full indexing for running arbitrary queries, however, may take over 20h). Models can be retrieved in milliseconds. The only way to achieve this kind of performance is by making use of the databases bulk loading facilities, which is easy when dealing with a single table. (Namespaces can be cached in memory.) Any comments are welcome!
Received on Wednesday, 31 March 2004 07:52:24 UTC