- From: Jonas Liljegren <jonas@paranormal.o.se>
- Date: Mon, 15 Nov 1999 02:07:34 +0100
- To: Sergey Melnik <melnik@DB.Stanford.EDU>, RDF Intrest Group <www-rdf-interest@w3.org>
Sergey Melnik wrote: > > To have a common way to refer to different approaches to storing RDF in > relational databases I compiled a brief summary of what is known to me > and made it accessible under > > http://www-db.stanford.edu/~melnik/rdf/db.html Hi! I waited with an answer, because I wanted to get time to prepare a realy good discussion on RDF/SQL. But I don't want to wait any more. So here is my unstructured thoughts: Background: I adopted RDF Schema for the development of a semantic network for a limited area of related topics. That would be a virtual community for sharing ideas in the area, in a way that would build up the database in a structured manner: http://paranormal.o.se/perl/proj/rdf/schema_editor/letters/my_project.txt I am using the *model*. The XML representation is not importent in this stage. I am developing a tool for building and experimenting with RDF schemas: http://paranormal.o.se/perl/proj/rdf/schema_editor/ One of the first things I want to do is to represent all the members in the virtual community. I picked the structure for user data in P3P as a base for this stage of my experiment. The actual RDF Schema is my personal interpretation of an (much) earlier version of this: http://www.w3.org/TR/P3P#Data_Schemas History: I announced the "RDF Schema editor" project a couple of months ago. I had a public demo. But this is an experiment under development. I have not yet put in controls to prevent recursive relations. A guest added a recursive relation, and that was the end to the demo. The program is very object oriented. In my high level API, I wasn't realizing that a routine sorting of objects called the database for every pair comparsion, making it extreamly slow. This has now been fixed with extensive chaching of method return values. You CAN use the program, if the latest version is working at the time you test it: http://paranormal.o.se/perl/proj/rdf/schema_editor/devel/latest/rdf.cgi There is more TODO: http://paranormal.o.se/perl/proj/rdf/schema_editor/devel/latest/doc/TODO The Data model: http://paranormal.o.se/perl/proj/rdf/schema_editor/devel/latest/doc/rdf.sql Comments: The data model is designed to save as much space as possible and to implement every detail in the RDF and schema specification. I am planning to solve the speed problem with intelligent use och caching and maby even second order databases, generated from the core database. In a big RDF database, there will be lots of indirect resources. These resources doesn't have to be stored as separate records in the resource table. If that is the case, they are taken to be local recources. Every statement is in it self a resource. The statement is represented as a resource in the way used with reified statements. I am saving a lot of space (in fact: infinit space) in not storing the model of the statement. I am letting every statement have a URI based on the statement ID and the local URI. (I now understand that I must also store the full URI base for the statements imported from foreign bases.) An effect of using the statement table for representing both reified statements and normal statements, is that I must differ between statemnets that only is reefied, and not "facts". That is the role of the FACT field. Everything has an URI. Statements has an URI. My conclusion is that even the internal literals can be thought of as resources. The RDF spec makes a very clear distinction between literals and resources. But it makes for much more symetry if you could store the URI content in the literal. And you don't have to have an extra field in the STATEMENT table to tell if the object is an literal or resource. And more: both types can have a language tagging. In the database; the object is taken to be a literal if and only if the VALUE field is defined. Discussion here: http://paranormal.o.se/perl/proj/rdf/schema_editor/letters/literals.txt I was of course very temted to breake out the URI to a separate table for namespaces. It would save much space. But I can't see how to do it, if I want to confirm to the AboutEachPrefix: http://paranormal.o.se/perl/proj/rdf/schema_editor/letters/namespaces.txt http://paranormal.o.se/perl/proj/rdf/schema_editor/letters/aboutEachPrefix.txt aboutEachPrefix (AEP) resources is marked in the RESOURCES table. The PREFIX table is (going to be) used as a chache for mapping from a resouce to a AEP. If I want to know all attributes of a resource, I would have to both examine the STATEMENTS table for SUBJECTS with that id. I would also have to get a list of resource IDs from the PREFIX table - one for AEP that includes the resource - and check the STATEMENTS table for each of the collected IDs. The PREFIX table would have to be updated for every added resource, by getting all AEPs and for each of them, see if the added resource has that prefix. A total reconstruction can be made by, for every AEP, do a SQL search, utilizing the SQL "like 'namespace%'" construct. That type of search is a MUST for solving the AEP. And THAT is not possible if a separate namespace table is used INSTEAD of a full URI. It would be nice if I wouldn't have to do this. But You can't just know what part of an URI is suppoesed to be it's namespace. Since both the STATEMENTS table and the RESOURCES table both represent resources, every ID is unique. There is no statement that has the same ID as a record in the resource table. They use the same sequence. Caching: The RDF Schema specification describes some objects and properties. These are represented in my "RDF Schema editor", in the same way as all other data; with triples. But the properties for those objects will not change. They can therefore be chached indefinitly. The main content of the database will represent the changing and growing knowledge of the community. What is to be considered 'fact' will therefore change with time. Objects will get new properties, etc. This means that those data can't be cached indefinitly. The schema editor will often request the same info many times in a single request. (Especially in a sorting function. ;-) Caching within a request could therefore be done. The best fing would be to tie the cache to namespaces. Every namespace would have a change date and a cahce. The date would be checked at the start of every request and the whole cache would be flushed if updated. This handling of the cache would have to solve the same problem as that of AEP. You don't HAVE to have the latest version of the data. Even if data changes, the cahe could be kept some time depending of your tolerance of getting a little old data. If a person adds new data, he would certainly want to see his addition take effect. So for this person, the chache would be flushed for the namespace involved. I also think that the user would like to flush the cache on will. I have a couple of hundred objects in my database. The most complex page is a autogenerated form of every property you can add to an object, and all the resources you can chose from (the range), for each property. You can add a bag or multipple values for a property, or a combination of both. (The goal is to allow EVERYTHING that RDF allows.) Example: http://paranormal.o.se/perl/proj/rdf/schema_editor/devel/latest/rdf.cgi?state=edit_form&uri=%3AName I am using mod_perl (and Apache::Registry) for caching the compiled classes and utilize a constant link to the postgreSQL database (using perl DBI). The above page would take above 30 seconds to generate. A large part of this time was used to sort the values of the drop-down boxes. But every list involves recursive functions to get all the resources for a range. By caching some of the method return values, I could cut the time to under 10 seconds. But this cache is generated from scaratch for every single request. And it is still ineffective to sort resources on properties that has to be looked up from either the database or the cache. (That should be optimized.) Of course, I haven't yet constructed any indices. But if I had them, maby I wouldn't give as much thought to the caching and other speed gains. I could also use limited char fields. But how do I know how long an URI is? Very old browses has an URL limit of 255 chars. But that is not enough. The PostgreSQL "text" datatype can be of any length. The page above represents an extreme circumstance. A normal use of the RDF database would be to extract some connections for a specific object, not enumerate every single object and property multipple times. This would be a more normal (dynamic) form: http://paranormal.o.se/perl/proj/rdf/schema_editor/devel/latest/rdf.cgi?state=edit_form&uri=%3A8 Ok. That's enough for today. Just under 200 lines. ;) -- / Jonas - http://paranormal.o.se/myself/cv/index.html
Received on Sunday, 14 November 1999 20:03:41 UTC