Re: One comment on RDF mapping [related to ISSUE 67 and ISSUE 81] from Alan Wu on 2008-06-12 (public-owl-wg@w3.org from June 2008)

From: Alan Wu <alan.wu@oracle.com>
Date: Thu, 12 Jun 2008 12:48:56 -0400
To: Bijan Parsia <bparsia@cs.man.ac.uk>
CC: public-owl-wg@w3.org
Message-ID: <48515378.4050702@oracle.com>
Bijan,

Instead of arguing on something we don't agree. Let me try to see if we 
can agree on the following

1) There is no significant benefit excluding the axiom triple in RDF 
serialization if there is annotation.
2) Extracting a large number of axiom triples from an un-ordered set of 
reified triples is time consuming.

Thanks,

Zhe

Bijan Parsia wrote:
>
> On Jun 11, 2008, at 10:02 PM, Alan Wu wrote:
>
>>
>> Bijan,
>>
>>>>> In this scenario, a serialization with the original axiom triple 
>>>>> differs
>>>>> much from another serialization without the original axiom triple.
>>>
>>> This is true, but it isn't impossible or even prohibitive to produce 
>>> a reasonably optimized structure either way.
>>>
>> The key thing is not that it is not impossible to build optimized 
>> structure. Rather, it takes more time
>> when original axiom triple is absent.
>
> Hence my claim that it's not prohibitive.
>
>> System efficiency will get slower.
>
> Well the question is how badly.
>
>> And I fail to see the necessity to force this additional processing 
>> on vendors.
>
> I can't argue that right now since I'm still a bit confused by the 
> overall discussion. I'm just disputing some of your claims about the 
> over all cost.
>
>>>>> In the original triples are available, then they can be used 
>>>>> directly.
>>>
>>> Sure, but you still have to manage the reification.
>> Why?  If axiom triples are in the RDF graph, I can choose to ignore 
>> other reification for performance reasons.
>
> If you are in the application scenario of having 100 million triples 
> where a large fraction of them are annotated, it's pretty safe to 
> assume that the annotations matter to the application (e.g., managing 
> provenance of the triples). Thus, you have to face the possibility 
> that there will be many user queries touching the annotations. It's 
> *much* worse, of course, to hit user queries than to slow down 
> loading. Loading tends to be an infrequent and offline operation and 
> thus less time sensitive. Not so with user queries.
>
> Furthermore, if you are dumping and loading 100 million triples to 
> files, I have to say, it's a completely reasonable criterion that they 
> not be randomized. This holds for *any* data.
>
>>>>> In the latter case where the original triples are not available, they
>>>>> have to be constructed. And the way
>>>>> to reconstruct them is to perform *joins*, which are costly.
>>>
>>> It's pretty clear, I think, that you can do this at parse time with 
>>> relatively minimal effort.
>> Minimal effort?
>
> Relatively, yes. So I believe. Remember that adding triples isn't 
> free, esp. at these scale. If a significant portion of your data is 
> annotated then the cost of 5x blow up in the number of triples 
> (including the explicit one) may dominate.
> [snip]
>>> Plus you are making some presumptions about the likelihood of random 
>>> data. That is a worst case. Most of the time reified triples are 
>>> bundled pretty nicely, esp. in RDF/XML which has a special construct 
>>> for it (i.e. nodeID on a property element).
>>>
>> I don't disagree. But I simply cannot make this assumption that all 
>> reified triples are adjacent in Oracle's product.
>
> I don't think you have to. But it is worth distinguishing worst case 
> from realistic cases.
>
>>> I dispute this. Reificaitons come with a very specific, obvious 
>>> structure (e.g., type statement and special predicates). The only 
>>> tricky bit is assembling the triples.
>>>
>>> Let's, for simplicity, presume that every reified triple is complete 
>>> (since we generated these from annotations or negations, etc.)
>>>
>>> Then, for each reified triple, there are (at least) 4 triples with a 
>>> common subject:
>>>
>>> A: SUB rdf:type Statement.
>>> B: SUB subject S.
>>> C: SUB predicate P.
>>> D: SUB object O.
>>>
>>> (Plus annotation triples starting with SUB)
>>>
>>> Let there be n such SUBs in your ontology of m triples. Also let us 
>>> suppose we have a special table called SUBS which has four columns, 
>>> ID, S, P and O,  which is initially empty and indexed on ID. Now, 
>>> suppose a streaming RDF parser sends you a triple. If it is not of 
>>> the form A-D, then add it to your store in the normal way. If it is 
>>> of form A, add SUBi, null, null, null to SUBS. If it is of one of 
>>> the other forms, you retrieve the relevant ID and update the 
>>> corresponding column. (If the retrieval is empty, you add a new 
>>> tuple with the ID and corresponding column filled.)
>>>
>> I want to remind you that UPDATE is much slower compared to insert.
>
> Perhaps I was unclear that I was deliberately trying a rather naive 
> approach in detail so we could have a clear common basis for discussion.
>
>> Say you have 100 million annotated axioms, your scheme
>> will involve many, many updates.
>
> It very much depends on the arrangement in the file, of course, the 
> nature of your cache and buffer, etc. etc.
>
>> It is not going to perform well at all, even with Oracle database.
>
> It involves a trade off of load time for query time and database size.
>
> [snip]
>>> Now, what's the worst case for this? If a row is non null you don't 
>>> have to read it again (you can just record what SUBis are 
>>> "complete") and can page it out. So, the worst case would be n (SUB 
>>> null null null)s. Then n (SUB s null null). Then n (SUB s p null). 
>>> Then n (SUB s p o). In the best case, there would be at most 4 
>>> triples in memory and no reads (because of a buffer). You wouldn't 
>>> even necessarily need a distinct table SUBS if you made your main 
>>> table quad based.
>> Again, I cannot assume the triples are in the form of best case.
>
> I don't think I've argued that you have to. I started this paragraph 
> with the worst case. My point is the best case is very good and 
> actually fairly easy to achieve.
>
>>> It'd probably be more efficient to collect the 4 triples (B-D 
>>> really) in separate structures. Unless there's a syntax error every 
>>> structure/table will be the same size and, when sorted on ID, such 
>>> that you can combine them by iterating over them directly. Either 
>>> way, this is not infeasible.
>> It is feasible to get axiom triples out of reification. No one doubts 
>> that.
>
> Sorry, you seemed to suggest that it was and made some assertions 
> about the necessity of joins. I've shown that joins (at user time) are 
> not necessary. So I don't understand the argument at all now :)
>
>> But why do we want to slow things down when we don't have to?
>
> Well, I'm unconvinced still that it'll slow things down all that much, 
> and, more importantly, that it's avoidable in order to have decent 
> query performance over annotations. You seem to presume a scenario 
> where 1) there are a lot of annotations, 2) the file is randomized, 
> and 3) the user doesn't want to use the annotations. This doesn't seem 
> to be a case worth optimizing for, esp. at the potential cost of 
> adding 100 million triples to a file. (Esp. when, in RDF/XML, you can 
> *significantly* optimize your serialization.)
>
> I believe I've shown that a) adding the triple has a cost and 2) that 
> extracting the triple from reification can be done at load time and 3) 
> it's not prohibitive to do so. Now, it may be that that's 
> insufficient, but the question really seems open. I know there are 
> people who want, for a variety of reasons, to also add the triple. 
> It's not clear to me which way to go on it, but I think your 
> performance argument is not (yet to me) conclusive, or even very 
> strong (when one considers all the factors). I could be wrong, of course.
>
> Cheers,
> Bijan.
>
Received on Thursday, 12 June 2008 16:50:58 UTC