Re: federation use case from Eric Prud'hommeaux on 2004-07-29 (public-rdf-dawg@w3.org from July to September 2004)

From: Eric Prud'hommeaux <eric@w3.org>
Date: Thu, 29 Jul 2004 11:26:27 -0400
To: "Seaborne, Andy" <andy.seaborne@hp.com>
Cc: public-rdf-dawg@w3.org
Message-ID: <20040729152627.GA3684@w3.org>
On Thu, Jul 29, 2004 at 03:36:07PM +0100, Seaborne, Andy wrote:
> 
> Eric wrote:
> > ===== A Practical Union Query =====
> > 
> > FOAF documents seldom have assertions of the form
> >   A foaf:knows B. B foaf:depiction C.
> > so graph merging is required to answer queries like:
> > 
> >   ?who foaf:depiction ?pic1.
> >   ?who foaf:knows ?whom.
> >   ?whom foaf:depiction ?pic2.
> > 
> > FOAF documents tend to be small so merging the graphs isn't too painful.
> 
> Merging two documents can be done purely at the RDf level but often it is
> useful to "smush" the graphs together: in FOAF, that enables joining
> together the information about one person from separate files (separate
> bNodes).
> 
> aggregation strategies <http://rdfweb.org/2001/01/design/smush.html>
> posted by DanC_tst at 2001-07-20 16:02 (+)
> in
> http://rdfig.xmlhack.com/2001/07/20/2001-07-20.html
> 
> > ===== A Practical Federate Query =====
> > 
> > To date, my implementations are more simple than the efficient. The
> > first query returns a set of results and the second query is
> > iteratively performed for each of those results, ala:
> 
> OK - I understand federation better - thanks for fleshing out the example.
> 
> Interestingly, that is how I do matching within a single query in BRQL/Jena.
> Bindings are passed from one part, say, the WHERE clause, to an OPTIONAL
> clause.  The bindings are substituted into a pattern before graph matching.
> The only difference is that all queries (conjunctive graph matches) go to
> the same source each time.

I suspect that keeping a running table of bindings is a common way to
solve this problem. The Algae spec even mandated such an approach and
used that to define the opperations like conjunction, disjunction,
not, optional. It may be handy for BRQL to do the same, but I'd like
to hear from the backward chaining community on this.

Jos, is it useful to define logical query operations in terms of the
affect on a "current" result set? Is that easy to map into forward
chaining terms?
I guess this is for the plain english part of the description as I bet
the algebra will be defined in terms sets.

> The original RDQL execution engine in Jena1 did much the same on a
> triple-by-triple basis.  In Jena2, this moved into the lower levels but also
> the SQL subsystem (due to Kevin et al) takes the whole expression and
> creates a single join (usually) expression.
> 
> > * I actually find it much safer than conventional parameterized
> > queries because you can use named parameters rather than simply
> > substituting the parameters to execute for the '?'s in the query
> > in the order that they appear.
> 
> +1 to named parameters.
> 
> 	Andy
> 
> 
> -------- Original Message --------
> > From: Eric Prud'hommeaux <mailto:eric@w3.org>
> > Date: 28 July 2004 16:14
> > 
> > On Wed, Jul 28, 2004 at 09:33:19AM -0400, Farrukh Najmi wrote:
> > > 
> > > Seaborne, Andy wrote:
> > > 
> > > > 
> > > > Could this be addressed with a general paramterised queries (see
> > > > also [1]) mechanism?  If CDDB and IMDB were on the same site it
> > > > would be nice to flow the paramters from the first query to the
> > > > second without a round trip back to the client.
> > > 
> > > I do not yet see the connection between federated queries and either
> > > parameterized queries or aggregate queries [1]. Parameterized queries
> > > (and Aggregate Queries) may be federated or not. I feel that
> > > federation is an orthogonal concept.
> > 
> > I brought up Aggregate Queries for contrast. By our current wording,
> > Aggregate Queries are useful if you wish to perform the same query in
> > multiple places and put the results together in some yet-undefined
> > way.
> > 
> > ===== A Practical Aggregate Query =====
> > 
> > Because FOAF data is pretty subject-centric, queries of the form
> > 
> >   ?w foaf:name ?name.
> >   ?w foaf:nick ?nick.
> >   ?w foaf:dataOfBirth ?dob.
> >   ?w foaf:depiction ?picture.
> > 
> > can frequently be answered by isolated FOAF documents. Merging a bunch
> > of FOAF documents and performing this query will frequently give you
> > no more results than querying the individual documents and merging
> > (does that mean striking obvious duplicates?) the results.
> > 
> > ===== A Practical Union Query =====
> > 
> > FOAF documents seldom have assertions of the form
> >   A foaf:knows B. B foaf:depiction C.
> > so graph merging is required to answer queries like:
> > 
> >   ?who foaf:depiction ?pic1.
> >   ?who foaf:knows ?whom.
> >   ?whom foaf:depiction ?pic2.
> > 
> > FOAF documents tend to be small so merging the graphs isn't too painful.
> > 
> > ===== A Practical Federate Query =====
> > 
> > The CDDB/IMDB example is fairly compelling because we have an
> > intuitive sense of what sort of data those sites serve and how it
> > would be organized. For consistency with the earlier FOAF examples,
> > imagine a couple of predicate-centric FOAF repositories. The first
> > stores data like:
> > 
> > KnowsDB
> >   A foaf:mbox EMAILA.
> >   A foaf:knows B.
> >   B foaf:mbox EMAILB.
> > 
> > and the other stores pictures:
> > 
> > PicDB
> >   A foaf:mbox EMAILA.
> >   A foaf:depiction PICA.
> > 
> > If we want the pictures of bob@example.com's friends
> > 
> >   ?who foaf:mbox <mailto:bob@example.com>.
> >   ?who foaf:knows ?whom.
> >   ?whom foaf:depiction ?pic.
> > 
> > an Aggregate Query will get zero results, a Union Query will require
> > the merge of two potentially huge data sources, and the properly
> > formed Federated Query
> > 
> > KnowsDB:
> >   ?who foaf:mbox <mailto:bob@example.com>.
> >   ?who foaf:knows ?whom.
> >   ?whom foaf:mbox ?mbox.
> > PicDB:
> >   ?w foaf:mbox ?mbox.
> >   ?w foaf:depiction ?pic.
> > 
> > will answer the question as efficiently as the protcol permits.
> > 
> > ===== Executing a Federated Query =====
> > 
> > To date, my implementations are more simple than the efficient. The
> > first query returns a set of results and the second query is
> > iteratively performed for each of those results, ala:
> > 
> > result of first query:
> >   +--------------------------+
> >   |                    ?mbox |
> >   +--------------------------+
> >   | <mailto:sue@example.com> |
> >   | <mailto:ann@example.com> |
> >   | <mailto:joe@example.com> |
> >   | <mailto:frd@example.com> |
> >   | <mailto:pat@example.com> |
> >   +--------------------------+
> > 
> > following queries:
> >   ?w foaf:mbox <mailto:sue@example.com>.
> >   ?w foaf:depiction ?pic.
> > ==>
> >   +-------------------------------+
> >   |                          ?pic |
> >   +-------------------------------+
> >   | <http://example.com/pics/sue> |
> >   +-------------------------------+
> > 
> >   ?w foaf:mbox <mailto:ann@example.com>.
> >   ?w foaf:depiction ?pic.
> > ==>
> >   +-------------------------------+
> >   |                          ?pic |
> >   +-------------------------------+
> >   | <http://example.com/pics/ann> |
> >   +-------------------------------+
> > ...
> > 
> > This costs (in this example, five) round trips but is still more
> > efficient than blindly scooping up all the triples in both the KnowsDB
> > and the PicDB.
> > 
> > ===== Parameterized Query =====
> > 
> > Because we had already bound ?mbox, we were able to transform
> >   ?w foaf:mbox ?mbox.
> >   ?w foaf:depiction ?pic.
> > to
> >   ?w foaf:mbox <mailto:sue@example.com>.
> >   ?w foaf:depiction ?pic.
> > ,
> >   ?w foaf:mbox <mailto:ann@example.com>.
> >   ?w foaf:depiction ?pic.
> > ...
> > 
> > This is analogous to conventional parameterized queries *:
> > 
> >   q = prepare("W foaf:mbox ?. W foaf:depiction PIC.")
> >   r1 = q.execute("<mailto:sue@example.com>")
> >   r2 = q.execute("<mailto:ann@example.com>")
> >   ...
> > 
> > * I actually find it much safer than conventional parameterized
> > queries because you can use named parameters rather than simply
> > substituting the parameters to execute for the '?'s in the query
> > in the order that they appear.
> > 
> > ===== Federated Query With Premise =====
> > 
> > If the protocol supported a premise, we could ship the earlier results
> > and have the picture query answered in one round trip:
> > 
> > PicDB:
> >   GIVEN:
> >   +--------------------------+
> >   |                    ?mbox |
> >   +--------------------------+
> >   | <mailto:sue@example.com> |
> >   | <mailto:ann@example.com> |
> >   | <mailto:joe@example.com> |
> >   | <mailto:frd@example.com> |
> >   | <mailto:pat@example.com> |
> >   +--------------------------+
> >   ?w foaf:mbox ?mbox.
> >   ?w foaf:depiction ?pic.
> > ==>
> >   +--------------------------+-------------------------------+
> >   |                    ?mbox |                          ?pic |
> >   +--------------------------+-------------------------------+
> >   | <mailto:sue@example.com> | <http://example.com/pics/sue> |
> >   | <mailto:ann@example.com> | <http://example.com/pics/ann> |
> >   | <mailto:joe@example.com> | <http://example.com/pics/joe> |
> >   | <mailto:frd@example.com> | <http://example.com/pics/frd> |
> >   | <mailto:pat@example.com> | <http://example.com/pics/pat> |
> >   +--------------------------+-------------------------------+
> > 
> > I can imagine providing a premise mechanism in V2 of DAWG-QL. We have
> > a slightly simpler language without the premise and we can answer the
> > same queries (less efficiently) if we limit our language to targeted
> > graph patterns and make the client do the iteration/unification.
> > 
> > But then, maybe it's not too much work to put premises in the
> > language, I dunno.
> > 
> > > As an aside, I wanted to observe that the ebXML Registry use case is
> > > also touching on both the federated query and parameterized query.
> > > 
> > > --
> > > Regards,
> > > Farrukh
> > > 
> > > 
> > > > 
> > > >    Andy
> > > > 
> > > > [1]
> > > >
> http://lists.w3.org/Archives/Public/public-rdf-dawg/2004JulSep/0090.html
> > > > 
> > > > Eric Prud'hommeaux wrote:
> > > > 
> > > > > Joe wants to see what top 10 movies of also had top-ten
> > > > > soundtracks. IMDB has information about movies and CDDB has info
> > > > > about music. Joe writes a query that gets the titles of all the
> > > > > top 10 movies. These are boudn to the variable ?t. He uses those
> > > > > bindings for ?t to then query IMDB to filter out the ones that
> > > > > did not have top 10 soundtracks. 
> > > > > 
> > > > > from CDDB:
> > > > > CONSTRUCT (?t foo bar)
> > > > > WHERE (?m tt:rank ?rc)
> > > > >      (?m cddb:soundtrack ?s)
> > > > >      (?s dc:title ?t)
> > > > > AND ?rc <= 10
> > > > > 
> > > > > from IMDB:
> > > > > SELECT ?t
> > > > > WHERE (?r tt:rank ?ri)
> > > > >      (?r dc:title ?t)
> > > > > AND ?ri <= 10
> > > > > 
> > > > > 
> > > > > This needs the ability to use variables bound in an earlier query
> > > > > to constrain later queries. It also requires some sort of query
> > > > > targeting. In algae, this looks like:
> > > > > 
> > > > > ns tt=<...> ns cddb=<...> ns dc=<...>
> > > > > attach <http://www.w3.org/...#remoteQuery> ?cddb (
> > > > >            server=<http://cddb.com/rq>)
> > > > > ask ?cddb ( ?m tt:rank ?rc {?rc <= 10} .
> > > > >        ?m cddb:soundtrack ?s .
> > > > >        ?s dc:title ?t )
> > > > > attach <http://www.w3.org/...#remoteQuery> ?imdb (
> > > > >            server=<http://imdb.com/querySrvc>)
> > > > > ask ?cddb ( ?r tt:rank ?ri {?ri <= 10} .
> > > > >        ?r dc:title ?t )
> > > > > collect (?t)
> > > > > 
> > > > > 
> > > > > This is very different from (more rigorous and expensive than) our
> > > > > current definition of aggregate query [1]. Such a query would, if
> > > > > the data is divided as the above queries suggest, return zero
> > > > > results. 
> > > > > 
> > > > > [1] http://www.w3.org/2001/sw/DataAccess/UseCases#d4.5
> > > > 
> > > > 
> > > 
> > > 
> > 
> > --
> > -eric
> > 
> > office: +81.466.49.1170 W3C, Keio Research Institute at SFC,
> >                         Shonan Fujisawa Campus, Keio University,
> >                         5322 Endo, Fujisawa, Kanagawa 252-8520
> >                         JAPAN
> >         +1.617.258.5741 NE43-344, MIT, Cambridge, MA 02144 USA
> > cell:   +1.857.222.5741 (does not work in Asia)
> > 
> > (eric@w3.org)
> > Feel free to forward this message to any list for any purpose other than
> > email address distribution.

-- 
-eric

office: +81.466.49.1170 W3C, Keio Research Institute at SFC,
                        Shonan Fujisawa Campus, Keio University,
                        5322 Endo, Fujisawa, Kanagawa 252-8520
                        JAPAN
        +1.617.258.5741 NE43-344, MIT, Cambridge, MA 02144 USA
cell:   +1.857.222.5741 (does not work in Asia)

(eric@w3.org)
Feel free to forward this message to any list for any purpose other than
email address distribution.
Received on Thursday, 29 July 2004 11:26:32 UTC