Re: One comment on RDF mapping [related to ISSUE 67 and ISSUE 81] from Bijan Parsia on 2008-06-13 (public-owl-wg@w3.org from June 2008)

From: Bijan Parsia <bparsia@cs.man.ac.uk>
Date: Fri, 13 Jun 2008 16:30:40 +0100
To: Alan Wu <alan.wu@oracle.com>
Cc: OWL Working Group WG <public-owl-wg@w3.org>
Message-Id: <0680AC08-91EA-431A-B478-AC929CA7AE68@cs.man.ac.uk>
On 13 Jun 2008, at 15:49, Alan Wu wrote:

>
> Bijan,
>
>>>> It needs to be balanced by other considerations.
>>> That is fair. BTW, I forgot to mention that adding the axiom  
>>> triple won't cause a huge expansion of the ontology. Do we
>>> truly worry about, say 20%, size increase?
>>
>> Sometimes. Do we really worry about a 20% increase at load time in  
>> the very extreme and unlikely worst case? How about 50%?
>>
>> You still haven't answered the question: if we have lots of  
>> annotations, thus they are significant, and we have queries over  
>> those annotations, as seems likely, aren't you going to have to do  
>> something special with reification and annotations *anyway*?
> Why do I need to do something special?  Say I take a very dumb  
> three column design and I get a query asking
> for annotation (or reification) information about a particular  
> subject (or a few matching subjects). The query is translated to a  
> multi-way join.

Hmm. Even a query for all triples authored by Bijan Parsia? I author  
a *lot* of triples.

Or if I ask for all the annotations about "knows".

> However, it will be efficient because it is very *selective*. We  
> don't expect there is a million annotations for one subject.
> SQL optimizer in this case will simply perform a few index lookups  
> (range scans)
> to get the job done.  In the extreme case that there is a million  
> annotations for one subject, well, bad luck.

You pick your bad luck :)

> This query is much different, complexity wise, from the query that  
> uses multi-way join to
> find out *all* axiom triples in the KB.

But what query would do that? Surely you aren't going to construct  
the entire asserted triple store as an intermediate table. Let's  
consider a single triple query with two unground variables. The  
likely worst case would be ?s p ?o, but otherwise, presumably, values  
for ?s and ?o are pretty selective. If the end result is relatively  
few triples, then the additional joins may not be so very bad.

But be that as it may, one can still avoid it. If one has path  
indexes it could work even better (but this is pretty close, in this  
case, to assembling the triples themselves and still trades load up  
time for query performance).

>   This query of course can run slightly

Slightly? it should run as fast as the full asserted triple case. If  
that's only slightly faster then I think your argument weakens :)

> faster if we take a quad,
> or five column, or six column, ... design, as you suggested. In  
> latter case, it is likely to be less number
> of index lookups. But the downside is that you need more processing  
> at loading time.

Yep.

> It is a trade off.

Indeed, but so we've been arguing :) At this point, unfortunately,  
it's basically a matter of our respective risk toleration profiles.  
You are worried about what I think is an uncommon case, and we  
disagree on the relative hit of that case under each of our approaches.

>> But then your use case isn't really precise. You want to optimize  
>> for the case where you have 100 million triples which are heavily  
>> annotated but no one will use your tool to query the annotations  
>> so you can essentially throw them away and go out of their way to  
>> make it hard to load the data.
>>
>> Do these people *hate* you, or something? :)
>>
>> Seriously, it seems like a pretty unlikely case. One where it  
>> would be perfectly reasonable to point them to a non-annotation  
>> triple extracting third party tool thingy. It doesn't seem a  
>> strong case to optimize for.
>>
> I hope my explanation helps a bit.

I see why you think that annotation queries under reification are no  
big deal, so yes, that does help me understand your position better.  
Thanks. I'll think some more. It'd be good to have some experiments  
or at least more detailed analysis.

For example, let's assume we have 100 million triples. Let's somewhat  
conservatively assume about 50 characters (bytes) per term and do no  
structure sharing (so copies of everything). 3 x 50 = 150 x  
100,000,000 = 15,000,000,000 that's, what, 13 gigbytes.

(and note you wouldn't  need all 13 gigs to be in memory. You need  
about 8.6 before you can start flushing triples.)

Last I checked, 8.6 isn't a ridiculous amount of memory. And this is  
for the very naive approach assuming almost no redundancy!

Adding the asserted version of the triples adds 2.6 or so gigs to the  
file. Not negligible.

If course, the contrary is when you can stream perfectly, which is  
whatever your buffers are. So, yeah, it makes a difference. But a one  
time memory load situation, or slightly slower memory load, or  
serializing sensibly, or....pick your poison.

(Again, I'm not arguing either way, per se. The open question to me  
still is the pain of non-optimized reification at query time. Mulling.)

Cheers,
Bijan.
Received on Friday, 13 June 2008 15:28:36 UTC