Storing RDF in a relational database

Don't know whether the list of strategies for storing RDF data in a
relational database at http://www-db.stanford.edu/~melnik/rdf/db.html is
still updated? Anyways, here's yet another (simplistic) approach:

CREATE TABLE statements (
  id BIGSERIAL PRIMARY KEY,
  model_ns SMALLINT NOT NULL,
  model TEXT NOT NULL,
  statement_ns SMALLINT,
  statement TEXT,
  subject_ns SMALLINT,
  subject TEXT NOT NULL,
  predicate_ns SMALLINT NOT NULL,
  predicate TEXT NOT NULL,
  object_ns SMALLINT,
  object TEXT,
  object_string TEXT,
  object_number DOUBLE PRECISION,
  object_boolean BOOLEAN,
  generated BOOLEAN
)
  WITHOUT OIDS
;

CREATE TABLE namespaces (
  id SERIAL PRIMARY KEY,
  name TEXT UNIQUE NOT NULL
);

CREATE INDEX statements_model_idx ON statements(model_ns, model);
-- Optional, for speeding up retrieval of models from specific data
sets:
CREATE INDEX statements_foo_idx ON statements(model_ns, model)
WHERE model_ns = '1';

-- The following indexes are only required for running queries:
CREATE INDEX statements_statement_idx ON statements(statement_ns,
statement);
CREATE INDEX statements_subject_idx ON statements(subject_ns, subject);
CREATE INDEX statements_predicate_idx ON statements(predicate_ns,
predicate);
CREATE INDEX statements_object_idx ON statements(object_ns, object);
CREATE INDEX statements_object_string_idx ON
statements(substr(object_string, 1, 128));
CREATE INDEX statements_object_number_idx ON statements(object_number);

Notes:

- Works with Postgres. Can easily be ported to MySQL if merge tables are
used instead of partial indexes. Doesn't work with databases that do not
allow arbitrary sized text fields (Oracle et al).

- The 'id' column allows the original order of statements to be
preserved. While this isn't strictly required by the RDF data model,
users tend to find it unacceptable to have their data randomly
reshuffled every time they access it (even if there is no real meaning
in the ordering).

- model_ns, model identify collections of related statements, or
subgraphs. The namespace of a model corresponds to a database, whereas
the name is a database entry. While queries are usually run over the
complete data set, most of the storing and retrieving is done at the
level of individual entries or databases.

- statement_ns, statement: Used for reification. Corresponds to an
rdf:ID attribute on a property.

- object_ns, object: Set if the object of a statement is a resource.

- object_string: Set if the object of a statement is a literal.

- object_number, object_boolean: Additional fields for querying
literals.

- implicit: Set to true if the statement was inserted based on
information found in the schema. Example: rdf:type statements with
superclasses, or equivalent classes. Statements with generated = true
are usually ignored when retrieving data, but are useful for queries.
(Currently I don't actually use this, until I have found a way of
efficiently generating those implicit statements.)

- I dare claim that this approach is about as fast as is possible for
storing and retrieving RDF statements with a relational database (other
schemas may be more efficient for querying, hard to say). It is possible
to load over 120M statements in a matter of hours. (Full indexing for
running arbitrary queries, however, may take over 20h). Models can be
retrieved in milliseconds. The only way to achieve this kind of
performance is by making use of the databases bulk loading facilities,
which is easy when dealing with a single table. (Namespaces can be
cached in memory.)

Any comments are welcome!

Received on Wednesday, 31 March 2004 07:52:24 UTC