Re: One comment on RDF mapping [related to ISSUE 67 and ISSUE 81] from Bijan Parsia on 2008-06-11 (public-owl-wg@w3.org from June 2008)

From: Bijan Parsia <bparsia@cs.man.ac.uk>
Date: Wed, 11 Jun 2008 22:30:23 +0100
To: Alan Wu <alan.wu@oracle.com>
Cc: public-owl-wg@w3.org
Message-Id: <7FEBDF57-ED3B-47AB-BC6D-DFFFDF7BB9DA@cs.man.ac.uk>
On Jun 11, 2008, at 10:02 PM, Alan Wu wrote:

>
> Bijan,
>
>>>> In this scenario, a serialization with the original axiom triple  
>>>> differs
>>>> much from another serialization without the original axiom triple.
>>
>> This is true, but it isn't impossible or even prohibitive to  
>> produce a reasonably optimized structure either way.
>>
> The key thing is not that it is not impossible to build optimized  
> structure. Rather, it takes more time
> when original axiom triple is absent.

Hence my claim that it's not prohibitive.

> System efficiency will get slower.

Well the question is how badly.

> And I fail to see the necessity to force this additional processing  
> on vendors.

I can't argue that right now since I'm still a bit confused by the  
overall discussion. I'm just disputing some of your claims about the  
over all cost.

>>>> In the original triples are available, then they can be used  
>>>> directly.
>>
>> Sure, but you still have to manage the reification.
> Why?  If axiom triples are in the RDF graph, I can choose to ignore  
> other reification for performance reasons.

If you are in the application scenario of having 100 million triples  
where a large fraction of them are annotated, it's pretty safe to  
assume that the annotations matter to the application (e.g., managing  
provenance of the triples). Thus, you have to face the possibility  
that there will be many user queries touching the annotations. It's  
*much* worse, of course, to hit user queries than to slow down  
loading. Loading tends to be an infrequent and offline operation and  
thus less time sensitive. Not so with user queries.

Furthermore, if you are dumping and loading 100 million triples to  
files, I have to say, it's a completely reasonable criterion that  
they not be randomized. This holds for *any* data.

>>>> In the latter case where the original triples are not available,  
>>>> they
>>>> have to be constructed. And the way
>>>> to reconstruct them is to perform *joins*, which are costly.
>>
>> It's pretty clear, I think, that you can do this at parse time  
>> with relatively minimal effort.
> Minimal effort?

Relatively, yes. So I believe. Remember that adding triples isn't  
free, esp. at these scale. If a significant portion of your data is  
annotated then the cost of 5x blow up in the number of triples  
(including the explicit one) may dominate.
[snip]
>> Plus you are making some presumptions about the likelihood of  
>> random data. That is a worst case. Most of the time reified  
>> triples are bundled pretty nicely, esp. in RDF/XML which has a  
>> special construct for it (i.e. nodeID on a property element).
>>
> I don't disagree. But I simply cannot make this assumption that all  
> reified triples are adjacent in Oracle's product.

I don't think you have to. But it is worth distinguishing worst case  
from realistic cases.

>> I dispute this. Reificaitons come with a very specific, obvious  
>> structure (e.g., type statement and special predicates). The only  
>> tricky bit is assembling the triples.
>>
>> Let's, for simplicity, presume that every reified triple is  
>> complete (since we generated these from annotations or negations,  
>> etc.)
>>
>> Then, for each reified triple, there are (at least) 4 triples with  
>> a common subject:
>>
>> A: SUB rdf:type Statement.
>> B: SUB subject S.
>> C: SUB predicate P.
>> D: SUB object O.
>>
>> (Plus annotation triples starting with SUB)
>>
>> Let there be n such SUBs in your ontology of m triples. Also let  
>> us suppose we have a special table called SUBS which has four  
>> columns, ID, S, P and O,  which is initially empty and indexed on  
>> ID. Now, suppose a streaming RDF parser sends you a triple. If it  
>> is not of the form A-D, then add it to your store in the normal  
>> way. If it is of form A, add SUBi, null, null, null to SUBS. If it  
>> is of one of the other forms, you retrieve the relevant ID and  
>> update the corresponding column. (If the retrieval is empty, you  
>> add a new tuple with the ID and corresponding column filled.)
>>
> I want to remind you that UPDATE is much slower compared to insert.

Perhaps I was unclear that I was deliberately trying a rather naive  
approach in detail so we could have a clear common basis for discussion.

> Say you have 100 million annotated axioms, your scheme
> will involve many, many updates.

It very much depends on the arrangement in the file, of course, the  
nature of your cache and buffer, etc. etc.

> It is not going to perform well at all, even with Oracle database.

It involves a trade off of load time for query time and database size.

[snip]
>> Now, what's the worst case for this? If a row is non null you  
>> don't have to read it again (you can just record what SUBis are  
>> "complete") and can page it out. So, the worst case would be n  
>> (SUB null null null)s. Then n (SUB s null null). Then n (SUB s p  
>> null). Then n (SUB s p o). In the best case, there would be at  
>> most 4 triples in memory and no reads (because of a buffer). You  
>> wouldn't even necessarily need a distinct table SUBS if you made  
>> your main table quad based.
> Again, I cannot assume the triples are in the form of best case.

I don't think I've argued that you have to. I started this paragraph  
with the worst case. My point is the best case is very good and  
actually fairly easy to achieve.

>> It'd probably be more efficient to collect the 4 triples (B-D  
>> really) in separate structures. Unless there's a syntax error  
>> every structure/table will be the same size and, when sorted on  
>> ID, such that you can combine them by iterating over them  
>> directly. Either way, this is not infeasible.
> It is feasible to get axiom triples out of reification. No one  
> doubts that.

Sorry, you seemed to suggest that it was and made some assertions  
about the necessity of joins. I've shown that joins (at user time)  
are not necessary. So I don't understand the argument at all now :)

> But why do we want to slow things down when we don't have to?

Well, I'm unconvinced still that it'll slow things down all that  
much, and, more importantly, that it's avoidable in order to have  
decent query performance over annotations. You seem to presume a  
scenario where 1) there are a lot of annotations, 2) the file is  
randomized, and 3) the user doesn't want to use the annotations. This  
doesn't seem to be a case worth optimizing for, esp. at the potential  
cost of adding 100 million triples to a file. (Esp. when, in RDF/XML,  
you can *significantly* optimize your serialization.)

I believe I've shown that a) adding the triple has a cost and 2) that  
extracting the triple from reification can be done at load time and  
3) it's not prohibitive to do so. Now, it may be that that's  
insufficient, but the question really seems open. I know there are  
people who want, for a variety of reasons, to also add the triple.  
It's not clear to me which way to go on it, but I think your  
performance argument is not (yet to me) conclusive, or even very  
strong (when one considers all the factors). I could be wrong, of  
course.

Cheers,
Bijan.
Received on Wednesday, 11 June 2008 21:31:01 UTC