Re: Storing RDF in relational databases from Jonas Liljegren on 1999-11-15 (www-rdf-interest@w3.org from November 1999)

From: Jonas Liljegren <jonas@paranormal.o.se>
Date: Mon, 15 Nov 1999 02:07:34 +0100
To: Sergey Melnik <melnik@DB.Stanford.EDU>, RDF Intrest Group <www-rdf-interest@w3.org>
Message-ID: <382F5CD6.62E14E68@paranormal.o.se>

Sergey Melnik wrote:
>
> To have a common way to refer to different approaches to storing RDF in
> relational databases I compiled a brief summary of what is known to me
> and made it accessible under
>
> http://www-db.stanford.edu/~melnik/rdf/db.html

Hi! I waited with an answer, because I wanted to get time to prepare a
realy good discussion on RDF/SQL. But I don't want to wait any
more. So here is my unstructured thoughts:

Background:

I adopted RDF Schema for the development of a semantic network for a
limited area of related topics. That would be a virtual community for
sharing ideas in the area, in a way that would build up the database
in a structured manner:

http://paranormal.o.se/perl/proj/rdf/schema_editor/letters/my_project.txt

I am using the *model*. The XML representation is not importent in
this stage. I am developing a tool for building and experimenting with
RDF schemas:
http://paranormal.o.se/perl/proj/rdf/schema_editor/

One of the first things I want to do is to represent all the members
in the virtual community. I picked the structure for user data in P3P
as a base for this stage of my experiment. The actual RDF Schema is my
personal interpretation of an (much) earlier version of this:
http://www.w3.org/TR/P3P#Data_Schemas

History:

I announced the "RDF Schema editor" project a couple of months ago. I
had a public demo. But this is an experiment under development. I have
not yet put in controls to prevent recursive relations. A guest added
a recursive relation, and that was the end to the demo.

The program is very object oriented. In my high level API, I wasn't
realizing that a routine sorting of objects called the database for
every pair comparsion, making it extreamly slow. This has now been
fixed with extensive chaching of method return values.

You CAN use the program, if the latest version is working at the time
you test it:

http://paranormal.o.se/perl/proj/rdf/schema_editor/devel/latest/rdf.cgi

There is more TODO:

http://paranormal.o.se/perl/proj/rdf/schema_editor/devel/latest/doc/TODO

The Data model:

http://paranormal.o.se/perl/proj/rdf/schema_editor/devel/latest/doc/rdf.sql

Comments:

The data model is designed to save as much space as possible and to
implement every detail in the RDF and schema specification. I am
planning to solve the speed problem with intelligent use och caching
and maby even second order databases, generated from the core
database.

In a big RDF database, there will be lots of indirect resources. These
resources doesn't have to be stored as separate records in the
resource table. If that is the case, they are taken to be local
recources.

Every statement is in it self a resource. The statement is represented
as a resource in the way used with reified statements. I am saving a
lot of space (in fact: infinit space) in not storing the model of the
statement. I am letting every statement have a URI based on the
statement ID and the local URI. (I now understand that I must also
store the full URI base for the statements imported from foreign
bases.)

An effect of using the statement table for representing both reified
statements and normal statements, is that I must differ between
statemnets that only is reefied, and not "facts". That is the role of
the FACT field.

Everything has an URI. Statements has an URI. My conclusion is that
even the internal literals can be thought of as resources. The RDF
spec makes a very clear distinction between literals and
resources. But it makes for much more symetry if you could store the
URI content in the literal. And you don't have to have an extra field
in the STATEMENT table to tell if the object is an literal or
resource. And more: both types can have a language tagging. In the
database; the object is taken to be a literal if and only if the VALUE
field is defined. Discussion here:

http://paranormal.o.se/perl/proj/rdf/schema_editor/letters/literals.txt

I was of course very temted to breake out the URI to a separate table
for namespaces. It would save much space. But I can't see how to do
it, if I want to confirm to the AboutEachPrefix:

http://paranormal.o.se/perl/proj/rdf/schema_editor/letters/namespaces.txt

http://paranormal.o.se/perl/proj/rdf/schema_editor/letters/aboutEachPrefix.txt

aboutEachPrefix (AEP) resources is marked in the RESOURCES table. The
PREFIX table is (going to be) used as a chache for mapping from a
resouce to a AEP. If I want to know all attributes of a resource, I
would have to both examine the STATEMENTS table for SUBJECTS with that
id. I would also have to get a list of resource IDs from the PREFIX
table - one for AEP that includes the resource - and check the
STATEMENTS table for each of the collected IDs. The PREFIX table
would have to be updated for every added resource, by getting all AEPs
and for each of them, see if the added resource has that prefix. A
total reconstruction can be made by, for every AEP, do a SQL search,
utilizing the SQL "like 'namespace%'" construct. That type of search
is a MUST for solving the AEP. And THAT is not possible if a separate
namespace table is used INSTEAD of a full URI. It would be nice if I
wouldn't have to do this. But You can't just know what part of an URI
is suppoesed to be it's namespace.

Since both the STATEMENTS table and the RESOURCES table both represent
resources, every ID is unique. There is no statement that has the same
ID as a record in the resource table. They use the same sequence.

Caching:

The RDF Schema specification describes some objects and properties.
These are represented in my "RDF Schema editor", in the same way as
all other data; with triples. But the properties for those objects
will not change. They can therefore be chached indefinitly.

The main content of the database will represent the changing and
growing knowledge of the community. What is to be considered 'fact'
will therefore change with time. Objects will get new properties,
etc. This means that those data can't be cached indefinitly.

The schema editor will often request the same info many times in a
single request. (Especially in a sorting function. ;-) Caching within
a request could therefore be done.

The best fing would be to tie the cache to namespaces. Every namespace
would have a change date and a cahce. The date would be checked at the
start of every request and the whole cache would be flushed if
updated. This handling of the cache would have to solve the same
problem as that of AEP.

You don't HAVE to have the latest version of the data. Even if data
changes, the cahe could be kept some time depending of your tolerance
of getting a little old data. If a person adds new data, he would
certainly want to see his addition take effect. So for this person,
the chache would be flushed for the namespace involved. I also think
that the user would like to flush the cache on will.

I have a couple of hundred objects in my database. The most complex
page is a autogenerated form of every property you can add to an
object, and all the resources you can chose from (the range), for each
property. You can add a bag or multipple values for a property, or a
combination of both. (The goal is to allow EVERYTHING that RDF
allows.) Example:

http://paranormal.o.se/perl/proj/rdf/schema_editor/devel/latest/rdf.cgi?state=edit_form&uri=%3AName

I am using mod_perl (and Apache::Registry) for caching the compiled
classes and utilize a constant link to the postgreSQL database (using
perl DBI).

The above page would take above 30 seconds to generate. A large part
of this time was used to sort the values of the drop-down boxes. But
every list involves recursive functions to get all the resources for a
range.

By caching some of the method return values, I could cut the time to
under 10 seconds. But this cache is generated from scaratch for every
single request. And it is still ineffective to sort resources on
properties that has to be looked up from either the database or the
cache. (That should be optimized.)

Of course, I haven't yet constructed any indices. But if I had them,
maby I wouldn't give as much thought to the caching and other speed
gains. I could also use limited char fields. But how do I know how
long an URI is? Very old browses has an URL limit of 255 chars. But
that is not enough. The PostgreSQL "text" datatype can be of any
length.

The page above represents an extreme circumstance. A normal use of
the RDF database would be to extract some connections for a specific
object, not enumerate every single object and property multipple
times. This would be a more normal (dynamic) form:

http://paranormal.o.se/perl/proj/rdf/schema_editor/devel/latest/rdf.cgi?state=edit_form&uri=%3A8

Ok. That's enough for today. Just under 200 lines. ;)

--
/ Jonas - http://paranormal.o.se/myself/cv/index.html

Received on Sunday, 14 November 1999 20:03:41 UTC