Re: One comment on RDF mapping [related to ISSUE 67 and ISSUE 81] from Bijan Parsia on 2008-06-11 (public-owl-wg@w3.org from June 2008)

From: Bijan Parsia <bparsia@cs.man.ac.uk>
Date: Wed, 11 Jun 2008 18:32:00 +0100
To: Alan Wu <alan.wu@oracle.com>
Cc: public-owl-wg@w3.org
Message-Id: <59879F89-0814-4ACD-B1F9-389271AD3FD3@cs.man.ac.uk>
On 11 Jun 2008, at 17:28, Alan Wu wrote:

>
> Bijan,
>
> I am attaching some relevant email discussions among Boris, Peter  
> and me. I sorted all the comments and
> responses. Hope it is readable.
>>
>> This presumes a specific and rather suboptimal implementation.  
>> There's nothing stopping an implementation from generating s p o  
>> or, indeed, *never* having the explicit subject, predicate, object  
>> triples (instead they could use a 4 place predicate with a  
>> "context" slot that indicated whether the triple was reified; many  
>> current triplestore implementations are, in fact, quad stores).
>>
>> Or one could keep reified triples in different tables, etc. etc. etc.
>>
>> I think we shouldn't conflate issues in the serialization with  
>> issues in the implementation.
>>
>> It is clear that using reification will stress that part of  
>> implementations. That is, implementations will have to be a bit  
>> more clever about it.
>>
>> Cheers,
>> Bijan.
>>
> 1) Question/suggestion from Boris:
> > the problem can be solved when you import an RDF ontology into  
> your database: when you see the reified axioms, you
>>  can disregard them and add the original one.
>
>
> 2) My response to 1)
>
>> The solution works fine for small ontologies. However, say you  
>> need to
>> deal with a big ontology (the corresponding serialized RDF graph has
>> more than
>> 100 million triples). Triples come in random order. You cannot buffer
>> all of them in memory because they simply don't fit.

You don't have to buffer them, certainly not all of them. There are a  
variety of ways to allow paging and indexing of the partial triples.

>> So you have to put
>> them
>> in secondary storage. And you need to build index on the triples  
>> because
>> you cannot afford to do a full scan of all the data to find stuff.

And?

>> In this scenario, a serialization with the original axiom triple  
>> differs
>> much from another serialization without the original axiom triple.

This is true, but it isn't impossible or even prohibitive to produce  
a reasonably optimized structure either way.

>> In the original triples are available, then they can be used  
>> directly.

Sure, but you still have to manage the reification.

>> In the latter case where the original triples are not available, they
>> have to be constructed. And the way
>> to reconstruct them is to perform *joins*, which are costly.

It's pretty clear, I think, that you can do this at parse time with  
relatively minimal effort.

> 3) Peter's response to  2)
>
>> But, again, this is no different from what has to be done for the OWL
>> constructs that require multiple triples.
>> In any case, you don't have to do a join.  There are reasonable  
>> nearly
>> O(n) algorithms for gathering together the triples of an OWL  
>> construct,
>> even if that construct is a reified axiom.  For ontologies of size  
>> 100
>> million triples this is even very easy - just index the triples by  
>> their
>> first element and keep them in main memory.
>
>
> 4) My response to 3)
>
>> It is different. We are making things worse by adding a new layer of
>> re-direction. I am not so sure about this assumption that  
>> everything can be kept in
>> memory.  In my opinion, this problem can be implemented using the
>> following SQL (pseudo).  Assume a very straightforward table  
>> structure
>> (three columns, subject, predicate, and object) for TRIPLES,

Yeah, but this is, I believe, a nonstarter to begin with. 4 columns  
minimally.

Plus you are making some presumptions about the likelihood of random  
data. That is a worst case. Most of the time reified triples are  
bundled pretty nicely, esp. in RDF/XML which has a special construct  
for it (i.e. nodeID on a property element).

[snip]
>> You are right, there are OWL structures that require multiple rdf  
>> triples to represent. It requires processing.
>> That is a bit unfortunate. However, not having the original axiom  
>> triple makes things worse (requires more processing), right?
>> Plus, we don't truly expect many OWL restrictions even in a big  
>> ontology. However, annotations can happen at a much larger scale.
>> I read that paper before. It is very interesting. Per predicate  
>> has its limitation. For example, if you happen to have many  
>> predicates, then you end up with many tables which itself is a  
>> problem. Now, if user is asking for (<:Mary> ?p ?o)
>> kind of query to find out all information about Mary, the query  
>> implementation gets tricky.
>> Yes building a special-purpose structure for reifications could  
>> help. However, you have to *identify those reifications first* and  
>> then put them into those structures, right? The identification  
>> itself is very costly if you have *many* annotated axioms.
[snip]

I dispute this. Reificaitons come with a very specific, obvious  
structure (e.g., type statement and special predicates). The only  
tricky bit is assembling the triples.

Let's, for simplicity, presume that every reified triple is complete  
(since we generated these from annotations or negations, etc.)

Then, for each reified triple, there are (at least) 4 triples with a  
common subject:

A: SUB rdf:type Statement.
B: SUB subject S.
C: SUB predicate P.
D: SUB object O.

(Plus annotation triples starting with SUB)

Let there be n such SUBs in your ontology of m triples. Also let us  
suppose we have a special table called SUBS which has four columns,  
ID, S, P and O,  which is initially empty and indexed on ID. Now,  
suppose a streaming RDF parser sends you a triple. If it is not of  
the form A-D, then add it to your store in the normal way. If it is  
of form A, add SUBi, null, null, null to SUBS. If it is of one of the  
other forms, you retrieve the relevant ID and update the  
corresponding column. (If the retrieval is empty, you add a new tuple  
with the ID and corresponding column filled.)

(Note, it would be wise to keep this in memory or to at least have a  
cache or a buffer for cases where the triples come close together in  
the right way.)

Now, what's the worst case for this? If a row is non null you don't  
have to read it again (you can just record what SUBis are "complete")  
and can page it out. So, the worst case would be n (SUB null null  
null)s. Then n (SUB s null null). Then n (SUB s p null). Then n (SUB  
s p o). In the best case, there would be at most 4 triples in memory  
and no reads (because of a buffer). You wouldn't even necessarily  
need a distinct table SUBS if you made your main table quad based.

It'd probably be more efficient to collect the 4 triples (B-D really)  
in separate structures. Unless there's a syntax error every structure/ 
table will be the same size and, when sorted on ID, such that you can  
combine them by iterating over them directly. Either way, this is not  
infeasible.

Cheers,
Bijan.
Received on Wednesday, 11 June 2008 17:29:53 UTC