Re: Querying multipl sources objective from Eric Prud'hommeaux on 2004-07-31 (public-rdf-dawg@w3.org from July to September 2004)

From: Eric Prud'hommeaux <eric@w3.org>
Date: Sat, 31 Jul 2004 00:55:58 -0400
To: Jim Hendler <hendler@cs.umd.edu>
Cc: Jos De_Roo <jos.deroo@agfa.com>, "Seaborne, Andy" <andy.seaborne@hp.com>, public-rdf-dawg@w3.org
Message-ID: <20040731045558.GB13232@w3.org>
On Fri, Jul 30, 2004 at 05:31:49PM -0400, Jim Hendler wrote:
> Forgive me for jumping in late -- but I am catching up after a bunch 
> of travel -- I've looked at 4. and the new 4.5.1 and I must admit to 
> confusion -- 4.5.1 looks kind of cool, but strikes me as sort of 
> either amazngly difficult to implement or not terribly useful - so I 
> may be missing something...
> 
> That is:
> 
> 4.5.1 Querying Multiple Sources
> 
> It should be possible for a query to specify which of the available 
> RDF graphs it is to be executed against. If more than one RDF graph 
> is specified, the result is as if the query had been executed against 
> the merge of the specified RDF graphs. Query processors with a single 
> available RDF graph trivially satisfy this objective.
> 
> 
> now consider -- if we used the old 4.5 we simply sent the query to 
> each DB and aggregated the results.  In the new one, we have two 
> choices:  either we (i) handle it in a distributed way, or (ii) we 
> merge the graphs and then query them
> 
> (i) seems to me to be very difficult - in fact, I'm pretty sure this 
> is a hard research task I would give someone a PhD for -- that is, if 
> we assume the graph is distributed among many servers, and each only 
> has part of the query space, then suppose I'm querying for a set of 
> triples concerning variables A,B, and C.   If I send the whole query 
> to every DB, there is not likely to be any one which unifies with all 
> the variables since they may be distributed among the various stores. 
> If I have to analyze the query, know what is in the stores, and then 
> send only the appropriate pieces of queries to the appropriate 
> servers and then reassemble the results, well, that seems hard to 
> implement (in fact, doing this in DB space has been the subject of a 
> number of research projects and theses in the past few years - so I 
> am pretty sure this is non-trivial to say the least)
> 
> (ii) if we assume that to avoid the difficulty in (i) we first unify 
> the graphs and then query them, well heck that won't scale worth crap 
> -- supposing, for example, I'm playing with the results of several 
> FOAF scrapers -- each one has collected more than 1M people and my 
> query is to find any two people with the same email address (or any 
> other feature) -- if I have to merge the graphs, I'll need some huge 
> amount of memory to do this
> 
> In short, (i) has difficulties with distribution and (ii) has 
> problems with centralization -- is either of these actually 
> implemented/implementable?   Am I misunderstanding the objective??

(i) has an almost trivial solution when you allow the user to
select what part of the query goes where. This pretty accurately
reflects how people do research today, finding pages with one
sort of information and manually (mentally) merging that with
data with another sort of information. For instance, I believe
that the CDDB/IMDB example is a perfectly reasonable model of
the degreee of expertise we can rely on from today's moderately
knowledgeable user.

(ii) is how most of us do our banal little queries every day.
Rarely do I see people making the same RDF query over multiple
repositories. Instead they identify a couple of sources, merge
them, and do a query across the resulting graph. Most data that
I've seen seems to be organized such that extra respositories
complement the data with related data rather than supplementing
with additional data of the same form.

I think that (ii) reperesents a big part of what we want people
to be able to do with the semantic web. (iii) (Aggregate Query)
can be easily accomplished with SQL today without grounding your
terms in a global namespace that allows documents to merge. I
think that the cool thing *is* merging graphs. Yes, that's
expensive, but I don't think that tne new problems that we want
to address with the semantic web get solved any other way.
-- 
-eric

office: +81.466.49.1170 W3C, Keio Research Institute at SFC,
                        Shonan Fujisawa Campus, Keio University,
                        5322 Endo, Fujisawa, Kanagawa 252-8520
                        JAPAN
        +1.617.258.5741 NE43-344, MIT, Cambridge, MA 02144 USA
cell:   +1.857.222.5741 (does not work in Asia)

(eric@w3.org)
Feel free to forward this message to any list for any purpose other than
email address distribution.
Received on Saturday, 31 July 2004 00:56:08 UTC