Re: Ambiguity and 4.5 Aggregate Query (and screw case) from Eric Prud'hommeaux on 2004-07-01 (public-rdf-dawg@w3.org from April to June 2004)

From: Eric Prud'hommeaux <eric@w3.org>
Date: Wed, 30 Jun 2004 22:14:27 -0400
To: Simon Raboczi <raboczi@tucanatech.com>
Cc: "Seaborne, Andy" <andy.seaborne@hp.com>, public-rdf-dawg@w3.org
Message-ID: <20040701021427.GG20261@w3.org>
On Wed, Jun 30, 2004 at 12:41:26PM -0400, Simon Raboczi wrote:
> 
> 
> On 30/06/2004, at 8:30, Seaborne, Andy wrote:
> 
> >
> >-------- Original Message --------
> >>From: Kendall Clark <>
> >>Date: 29 June 2004 14:16
> >>
> >>On Tue, Jun 29, 2004 at 07:03:24AM -0400, Eric Prud'hommeaux wrote:
> >>>[[
> >>>4.5 Aggregate Query
> >>
> >>We discussed this originally, as I recall. Aggregate, then query is
> >>distinct from query separately, aggregate results. I called what
> >>you're proposing "union query". Again, as I recall the discussion,
> >>there was more support for aggregate query than union query.
> >
> >There seems to me to be no need for explicit support for union query.  
> > If
> >the union is valuable, then make the union an identifiable web 
> >resource and
> >query that.  In other words, the query names the union as the target 
> >and
> >there is no need to have any thing in the QL or protocol.
> >
> >What this approach to union query does not permit is arbitrary, 
> >temporary
> >union.  As fetching a graph over the web is not trivial, having a 
> >server
> >which allowed client request to cause many large GETs (if implemented 
> >by
> >merge locally and query) to happen seems OK for small experiments 
> >only.  An
> >implementation could be done which asked each triple pattern in turn 
> >(with
> >previous triple matching values substituted in - it's a search tree 
> >here not
> >a linear pass) avoid GETting the whole models but cause very large 
> >numbers
> >of request from the request server to the model owner's servers.
> >
> >This can't be done with aggregate result query - the system would need 
> >a way
> >to name the separate graphs if it isn't done by the client issuing 
> >request
> >to each target and merging the results itself.  This is about the same
> >amount of data traffic if the results aren't having duplicates removed 
> >-
> >only extra copies of the query go out; it may be faster for the client 
> >to do
> >it as requests can be sent in parallel (network speed impacts this).
> 
> In earlier versions of Kowari, we supported both sorts of graph 
> aggregation  in the iTQL "from" clause.  Expressing "from <modelA> or 
> <modelB>" would request aggregate then query (union query), whereas 
> "from <modelA> xor <modelB>" would request separate queries then 
> aggregation (aggregate result query).  These expressions being queried 
> are arbitrary and temporary, but there's a facility to create a named 
> "view" graph whose value is defined by one of these expressions.
> 
> We implemented the union query the way Andy suggested, by querying each 
> triple pattern in turn.  Much as he surmises, in the case of a network 
> distributed this generates a great deal of network traffic (although 
> streamability helps by allowing some of the intermediate results to 
> occur at the same time).  It works, but it scales poorly.  However, 
> when used to aggregate graphs stored on the same server, performance 
> can be excellent.  Combining constraint results from different graphs 
> is really no different from combining results from different subjects 
> or predicates if your native store is based on quads.  As a result, 
> rather than being a query form only viable for small experiments, the 
> union query is the workhorse operator in every "from" clause, allowing 
> very large numbers of statements within a server to be manageably 
> organized into various named graphs.  The "xor" operator for aggregate 
> result query ended up relegated to the status of a performance hack, 
> used only for network distributed queries whose data were distributed 
> in such way that independent servers could meaningfully satisfy all the 
> query constraints on their own.
> 
> Union query will be the most useful graph aggregation operation 
> whenever it's feasible, and it's definitely feasible in at least the 
> non-distributed case.

It seems that a query service that's asked to select from multiple
sources (that themselves can service queries) can:
  relay: send the entire query to each source.
  aggregate: perform the query on the union of the sources.
  federate: send pieces of the query to different sources according
	    to a priori knowledge or targets in the query.

Relaying is the most scalable. Aggregation is the most like the user
experience that the semantic web promises. Federation can compete with
both, but is more complicated to write into a query.

I'm personally keen on providing the user with the ability to target
parts of the query because I believe that users will often know which
data comes from, say, IMBD, and which comes from CDDB [1].


[1] http://lists.w3.org/Archives/Public/public-rdf-dawg/2004JanMar/0131.html
-- 
-eric

office: +81.466.49.1170 W3C, Keio Research Institute at SFC,
                        Shonan Fujisawa Campus, Keio University,
                        5322 Endo, Fujisawa, Kanagawa 252-8520
                        JAPAN
        +1.617.258.5741 NE43-344, MIT, Cambridge, MA 02144 USA
cell:   +1.857.222.5741 (does not work in Asia)

(eric@w3.org)
Feel free to forward this message to any list for any purpose other than
email address distribution.
Received on Wednesday, 30 June 2004 22:14:27 UTC