- From: Eric Prud'hommeaux <eric@w3.org>
- Date: Sat, 31 Jul 2004 00:55:58 -0400
- To: Jim Hendler <hendler@cs.umd.edu>
- Cc: Jos De_Roo <jos.deroo@agfa.com>, "Seaborne, Andy" <andy.seaborne@hp.com>, public-rdf-dawg@w3.org
- Message-ID: <20040731045558.GB13232@w3.org>
On Fri, Jul 30, 2004 at 05:31:49PM -0400, Jim Hendler wrote: > Forgive me for jumping in late -- but I am catching up after a bunch > of travel -- I've looked at 4. and the new 4.5.1 and I must admit to > confusion -- 4.5.1 looks kind of cool, but strikes me as sort of > either amazngly difficult to implement or not terribly useful - so I > may be missing something... > > That is: > > 4.5.1 Querying Multiple Sources > > It should be possible for a query to specify which of the available > RDF graphs it is to be executed against. If more than one RDF graph > is specified, the result is as if the query had been executed against > the merge of the specified RDF graphs. Query processors with a single > available RDF graph trivially satisfy this objective. > > > now consider -- if we used the old 4.5 we simply sent the query to > each DB and aggregated the results. In the new one, we have two > choices: either we (i) handle it in a distributed way, or (ii) we > merge the graphs and then query them > > (i) seems to me to be very difficult - in fact, I'm pretty sure this > is a hard research task I would give someone a PhD for -- that is, if > we assume the graph is distributed among many servers, and each only > has part of the query space, then suppose I'm querying for a set of > triples concerning variables A,B, and C. If I send the whole query > to every DB, there is not likely to be any one which unifies with all > the variables since they may be distributed among the various stores. > If I have to analyze the query, know what is in the stores, and then > send only the appropriate pieces of queries to the appropriate > servers and then reassemble the results, well, that seems hard to > implement (in fact, doing this in DB space has been the subject of a > number of research projects and theses in the past few years - so I > am pretty sure this is non-trivial to say the least) > > (ii) if we assume that to avoid the difficulty in (i) we first unify > the graphs and then query them, well heck that won't scale worth crap > -- supposing, for example, I'm playing with the results of several > FOAF scrapers -- each one has collected more than 1M people and my > query is to find any two people with the same email address (or any > other feature) -- if I have to merge the graphs, I'll need some huge > amount of memory to do this > > In short, (i) has difficulties with distribution and (ii) has > problems with centralization -- is either of these actually > implemented/implementable? Am I misunderstanding the objective?? (i) has an almost trivial solution when you allow the user to select what part of the query goes where. This pretty accurately reflects how people do research today, finding pages with one sort of information and manually (mentally) merging that with data with another sort of information. For instance, I believe that the CDDB/IMDB example is a perfectly reasonable model of the degreee of expertise we can rely on from today's moderately knowledgeable user. (ii) is how most of us do our banal little queries every day. Rarely do I see people making the same RDF query over multiple repositories. Instead they identify a couple of sources, merge them, and do a query across the resulting graph. Most data that I've seen seems to be organized such that extra respositories complement the data with related data rather than supplementing with additional data of the same form. I think that (ii) reperesents a big part of what we want people to be able to do with the semantic web. (iii) (Aggregate Query) can be easily accomplished with SQL today without grounding your terms in a global namespace that allows documents to merge. I think that the cool thing *is* merging graphs. Yes, that's expensive, but I don't think that tne new problems that we want to address with the semantic web get solved any other way. -- -eric office: +81.466.49.1170 W3C, Keio Research Institute at SFC, Shonan Fujisawa Campus, Keio University, 5322 Endo, Fujisawa, Kanagawa 252-8520 JAPAN +1.617.258.5741 NE43-344, MIT, Cambridge, MA 02144 USA cell: +1.857.222.5741 (does not work in Asia) (eric@w3.org) Feel free to forward this message to any list for any purpose other than email address distribution.
Received on Saturday, 31 July 2004 00:56:08 UTC