Re: One comment on RDF mapping [related to ISSUE 67 and ISSUE 81]

Bijan,
>> Why do I need to do something special?  Say I take a very dumb three 
>> column design and I get a query asking
>> for annotation (or reification) information about a particular 
>> subject (or a few matching subjects). The query is translated to a 
>> multi-way join.
>
> Hmm. Even a query for all triples authored by Bijan Parsia? I author a 
> *lot* of triples.
>
Hey, you are using an extreme case :)  I am not going to optimize 
something because Bijan authored a billion triples :)
I am now removing all triples containing keyword Bijan to avoid future 
query performance problems.

> Or if I ask for all the annotations about "knows".
>
The thing is, if there are tons of results from a query, user can 
tolerate latency. Some folks are using #Triples/Second to
measure query speed. Well, if the user cannot tolerate, simply refine 
the query to be more selective. No system is
perfect. My point is, if a query for annotation does not match a lot of 
things, then it will still be efficient assuming a
dumb three column design.

>> This query is much different, complexity wise, from the query that 
>> uses multi-way join to
>> find out *all* axiom triples in the KB.
>
> But what query would do that? Surely you aren't going to construct the 
> entire asserted triple store as an intermediate table. Let's consider 
> a single triple query with two unground variables. The likely worst 
> case would be ?s p ?o, but otherwise, presumably, values for ?s and ?o 
> are pretty selective. If the end result is relatively few triples, 
> then the additional joins may not be so very bad.
>
A query will not do that. But a forward chaining algorithm will do that. 
(Let us not argue why some people choose
to do forward chaining, not in this thread at least)
>> I hope my explanation helps a bit.
>
> I see why you think that annotation queries under reification are no 
> big deal, so yes, that does help me understand your position better. 
> Thanks. I'll think some more. It'd be good to have some experiments or 
> at least more detailed analysis.
>
> For example, let's assume we have 100 million triples. Let's somewhat 
> conservatively assume about 50 characters (bytes) per term and do no 
> structure sharing (so copies of everything). 3 x 50 = 150 x 
> 100,000,000 = 15,000,000,000 that's, what, 13 gigbytes.
>
> (and note you wouldn't  need all 13 gigs to be in memory. You need 
> about 8.6 before you can start flushing triples.)
>
> Last I checked, 8.6 isn't a ridiculous amount of memory. And this is 
> for the very naive approach assuming almost no redundancy!
>
> Adding the asserted version of the triples adds 2.6 or so gigs to the 
> file. Not negligible.
>
> If course, the contrary is when you can stream perfectly, which is 
> whatever your buffers are. So, yeah, it makes a difference. But a one 
> time memory load situation, or slightly slower memory load, or 
> serializing sensibly, or....pick your poison.
>
> (Again, I'm not arguing either way, per se. The open question to me 
> still is the pain of non-optimized reification at query time. Mulling.)
>
I will think more on my side too.  Thanks for all the discussions. They 
truly help.

Cheers,

Zhe

Received on Friday, 13 June 2008 16:17:29 UTC