Re: One comment on RDF mapping [related to ISSUE 67 and ISSUE 81] from Alan Wu on 2008-06-11 (public-owl-wg@w3.org from June 2008)

From: Alan Wu <alan.wu@oracle.com>
Date: Wed, 11 Jun 2008 17:02:12 -0400
To: Bijan Parsia <bparsia@cs.man.ac.uk>
CC: public-owl-wg@w3.org
Message-ID: <48503D54.2060009@oracle.com>
Bijan,

>>> In this scenario, a serialization with the original axiom triple 
>>> differs
>>> much from another serialization without the original axiom triple.
>
> This is true, but it isn't impossible or even prohibitive to produce a 
> reasonably optimized structure either way.
>
The key thing is not that it is not impossible to build optimized 
structure. Rather, it takes more time
when original axiom triple is absent. System efficiency will get slower.
And I fail to see the necessity to force this additional processing on 
vendors.

>>> In the original triples are available, then they can be used directly.
>
> Sure, but you still have to manage the reification.
Why?  If axiom triples are in the RDF graph, I can choose to ignore 
other reification for performance reasons.
>
>>> In the latter case where the original triples are not available, they
>>> have to be constructed. And the way
>>> to reconstruct them is to perform *joins*, which are costly.
>
> It's pretty clear, I think, that you can do this at parse time with 
> relatively minimal effort.
>
Minimal effort?
>> 3) Peter's response to  2)
>>
>>> But, again, this is no different from what has to be done for the OWL
>>> constructs that require multiple triples.
>>> In any case, you don't have to do a join.  There are reasonable nearly
>>> O(n) algorithms for gathering together the triples of an OWL construct,
>>> even if that construct is a reified axiom.  For ontologies of size 100
>>> million triples this is even very easy - just index the triples by 
>>> their
>>> first element and keep them in main memory.
>>
>>
>> 4) My response to 3)
>>
>>> It is different. We are making things worse by adding a new layer of
>>> re-direction. I am not so sure about this assumption that everything 
>>> can be kept in
>>> memory.  In my opinion, this problem can be implemented using the
>>> following SQL (pseudo).  Assume a very straightforward table structure
>>> (three columns, subject, predicate, and object) for TRIPLES,
>
> Yeah, but this is, I believe, a nonstarter to begin with. 4 columns 
> minimally.
>
> Plus you are making some presumptions about the likelihood of random 
> data. That is a worst case. Most of the time reified triples are 
> bundled pretty nicely, esp. in RDF/XML which has a special construct 
> for it (i.e. nodeID on a property element).
>
I don't disagree. But I simply cannot make this assumption that all 
reified triples are adjacent in Oracle's product. 
>
>
> I dispute this. Reificaitons come with a very specific, obvious 
> structure (e.g., type statement and special predicates). The only 
> tricky bit is assembling the triples.
>
> Let's, for simplicity, presume that every reified triple is complete 
> (since we generated these from annotations or negations, etc.)
>
> Then, for each reified triple, there are (at least) 4 triples with a 
> common subject:
>
> A: SUB rdf:type Statement.
> B: SUB subject S.
> C: SUB predicate P.
> D: SUB object O.
>
> (Plus annotation triples starting with SUB)
>
> Let there be n such SUBs in your ontology of m triples. Also let us 
> suppose we have a special table called SUBS which has four columns, 
> ID, S, P and O,  which is initially empty and indexed on ID. Now, 
> suppose a streaming RDF parser sends you a triple. If it is not of the 
> form A-D, then add it to your store in the normal way. If it is of 
> form A, add SUBi, null, null, null to SUBS. If it is of one of the 
> other forms, you retrieve the relevant ID and update the corresponding 
> column. (If the retrieval is empty, you add a new tuple with the ID 
> and corresponding column filled.)
>
I want to remind you that UPDATE is much slower compared to insert. Say 
you have 100 million annotated axioms, your scheme
will involve many, many updates. It is not going to perform well at all, 
even with Oracle database.

Plus, 100 million incremental inserts into a table with index is also 
going to slow down performance quite a bit.
Index tree maintenance is time consuming.
> (Note, it would be wise to keep this in memory or to at least have a 
> cache or a buffer for cases where the triples come close together in 
> the right way.)
>
> Now, what's the worst case for this? If a row is non null you don't 
> have to read it again (you can just record what SUBis are "complete") 
> and can page it out. So, the worst case would be n (SUB null null 
> null)s. Then n (SUB s null null). Then n (SUB s p null). Then n (SUB s 
> p o). In the best case, there would be at most 4 triples in memory and 
> no reads (because of a buffer). You wouldn't even necessarily need a 
> distinct table SUBS if you made your main table quad based.
Again, I cannot assume the triples are in the form of best case.
>
> It'd probably be more efficient to collect the 4 triples (B-D really) 
> in separate structures. Unless there's a syntax error every 
> structure/table will be the same size and, when sorted on ID, such 
> that you can combine them by iterating over them directly. Either way, 
> this is not infeasible.
>
It is feasible to get axiom triples out of reification. No one doubts 
that. But why do we want to slow things down when we don't have to?

Thanks,

Zhe
Received on Wednesday, 11 June 2008 21:03:50 UTC