Re: Querying multipl sources objective from Jos De_Roo on 2004-07-30 (public-rdf-dawg@w3.org from July to September 2004)

From: Jos De_Roo <jos.deroo@agfa.com>
Date: Sat, 31 Jul 2004 00:49:54 +0200
To: "Jim Hendler <hendler" <hendler@cs.umd.edu>
Cc: "Seaborne, Andy" <andy.seaborne@hp.com>, eric@w3.org, public-rdf-dawg@w3.org
Message-ID: <OF283CBEA3.FC10BE11-ONC1256EE1.007AD8CB-C1256EE1.007D63B2@agfa.com>

Hi, Jim

That are indeed very clear observations and we have to do with an
objective here in any case (not a requirement). I am however quite
optimistic in these matters :) Not the extremes of all SW sources
or just 1 source, but an explicit set of sources could be a given
to an engine and an explicit set of queries can be answered in a
cascaded kind of Socratic complete dialog among different engines.
That's also why ":id q:select C; q:where P." is so useful as a
query rule as it drives that dialog (no matter wether within an
engine or between engines I would think...)

-- 
Jos De Roo, AGFA http://www.agfa.com/w3c/jdroo/




Jim Hendler <hendler@cs.umd.edu>
30/07/2004 23:31

 
        To:     Jos De_Roo/AMDUS/MOR/Agfa-NV/BE/BAYER@AGFA, eric@w3.org
        cc:     "Seaborne, Andy" <andy.seaborne@hp.com>, public-rdf-dawg@w3.org
        Subject:        Querying multipl sources objective



Forgive me for jumping in late -- but I am catching up after a bunch of 
travel -- I've looked at 4. and the new 4.5.1 and I must admit to 
confusion -- 4.5.1 looks kind of cool, but strikes me as sort of either 
amazngly difficult to implement or not terribly useful - so I may be 
missing something...

That is:

4.5.1 Querying Multiple Sources

It should be possible for a query to specify which of the available RDF 
graphs it is to be executed against. If more than one RDF graph is 
specified, the result is as if the query had been executed against the 
merge of the specified RDF graphs. Query processors with a single 
available RDF graph trivially satisfy this objective.


now consider -- if we used the old 4.5 we simply sent the query to each DB 
and aggregated the results.  In the new one, we have two choices:  either 
we (i) handle it in a distributed way, or (ii) we merge the graphs and 
then query them

(i) seems to me to be very difficult - in fact, I'm pretty sure this is a 
hard research task I would give someone a PhD for -- that is, if we assume 
the graph is distributed among many servers, and each only has part of the 
query space, then suppose I'm querying for a set of triples concerning 
variables A,B, and C.   If I send the whole query to every DB, there is 
not likely to be any one which unifies with all the variables since they 
may be distributed among the various stores.  If I have to analyze the 
query, know what is in the stores, and then send only the appropriate 
pieces of queries to the appropriate servers and then reassemble the 
results, well, that seems hard to implement (in fact, doing this in DB 
space has been the subject of a number of research projects and theses in 
the past few years - so I am pretty sure this is non-trivial to say the 
least)

(ii) if we assume that to avoid the difficulty in (i) we first unify the 
graphs and then query them, well heck that won't scale worth crap -- 
supposing, for example, I'm playing with the results of several FOAF 
scrapers -- each one has collected more than 1M people and my query is to 
find any two people with the same email address (or any other feature) -- 
if I have to merge the graphs, I'll need some huge amount of memory to do 
this

In short, (i) has difficulties with distribution and (ii) has problems 
with centralization -- is either of these actually 
implemented/implementable?   Am I misunderstanding the objective??
 
 thanks
 JH

-- 

Professor James Hendler                  
http://www.cs.umd.edu/users/hendler 
Director, Semantic Web and Agent Technologies       301-405-2696
Maryland Information and Network Dynamics Lab.      301-405-6707 (Fax)
Univ of Maryland, College Park, MD 20742

Received on Friday, 30 July 2004 18:50:48 UTC