Querying multipl sources objective from Jim Hendler on 2004-07-30 (public-rdf-dawg@w3.org from July to September 2004)

From: Jim Hendler <hendler@cs.umd.edu>
Date: Fri, 30 Jul 2004 17:31:49 -0400
To: Jos De_Roo <jos.deroo@agfa.com>, eric@w3.org
Cc: "Seaborne, Andy" <andy.seaborne@hp.com>, public-rdf-dawg@w3.org
Message-Id: <p06110402bd306c63a783@[10.0.1.2]>

Forgive me for jumping in late -- but I am catching up after a bunch 
of travel -- I've looked at 4. and the new 4.5.1 and I must admit to 
confusion -- 4.5.1 looks kind of cool, but strikes me as sort of 
either amazngly difficult to implement or not terribly useful - so I 
may be missing something...

That is:

4.5.1 Querying Multiple Sources

It should be possible for a query to specify which of the available 
RDF graphs it is to be executed against. If more than one RDF graph 
is specified, the result is as if the query had been executed against 
the merge of the specified RDF graphs. Query processors with a single 
available RDF graph trivially satisfy this objective.


now consider -- if we used the old 4.5 we simply sent the query to 
each DB and aggregated the results.  In the new one, we have two 
choices:  either we (i) handle it in a distributed way, or (ii) we 
merge the graphs and then query them

(i) seems to me to be very difficult - in fact, I'm pretty sure this 
is a hard research task I would give someone a PhD for -- that is, if 
we assume the graph is distributed among many servers, and each only 
has part of the query space, then suppose I'm querying for a set of 
triples concerning variables A,B, and C.   If I send the whole query 
to every DB, there is not likely to be any one which unifies with all 
the variables since they may be distributed among the various stores. 
If I have to analyze the query, know what is in the stores, and then 
send only the appropriate pieces of queries to the appropriate 
servers and then reassemble the results, well, that seems hard to 
implement (in fact, doing this in DB space has been the subject of a 
number of research projects and theses in the past few years - so I 
am pretty sure this is non-trivial to say the least)

(ii) if we assume that to avoid the difficulty in (i) we first unify 
the graphs and then query them, well heck that won't scale worth crap 
-- supposing, for example, I'm playing with the results of several 
FOAF scrapers -- each one has collected more than 1M people and my 
query is to find any two people with the same email address (or any 
other feature) -- if I have to merge the graphs, I'll need some huge 
amount of memory to do this

In short, (i) has difficulties with distribution and (ii) has 
problems with centralization -- is either of these actually 
implemented/implementable?   Am I misunderstanding the objective??

  thanks
  JH

-- 
Professor James Hendler			  http://www.cs.umd.edu/users/hendler 
Director, Semantic Web and Agent Technologies	  301-405-2696
Maryland Information and Network Dynamics Lab.	  301-405-6707 (Fax)
Univ of Maryland, College Park, MD 20742

Received on Friday, 30 July 2004 17:32:34 UTC