Re: federation use case from Eric Prud'hommeaux on 2004-07-28 (public-rdf-dawg@w3.org from July to September 2004)

From: Eric Prud'hommeaux <eric@w3.org>
Date: Wed, 28 Jul 2004 11:14:17 -0400
To: Farrukh Najmi <Farrukh.Najmi@Sun.COM>
Cc: andy.seaborne@hp.com, public-rdf-dawg@w3.org
Message-ID: <20040728151417.GB30026@w3.org>
On Wed, Jul 28, 2004 at 09:33:19AM -0400, Farrukh Najmi wrote:
> 
> Seaborne, Andy wrote:
> 
> >
> >Could this be addressed with a general paramterised queries (see also 
> >[1]) mechanism?  If CDDB and IMDB were on the same site it would be 
> >nice to flow the paramters from the first query to the second without 
> >a round trip back to the client.
> 
> I do not yet see the connection between federated queries and either 
> parameterized queries or aggregate queries [1]. Parameterized queries 
> (and Aggregate Queries) may be federated or not. I feel that federation 
> is an orthogonal concept.

I brought up Aggregate Queries for contrast. By our current wording,
Aggregate Queries are useful if you wish to perform the same query in
multiple places and put the results together in some yet-undefined
way.

===== A Practical Aggregate Query =====

Because FOAF data is pretty subject-centric, queries of the form

  ?w foaf:name ?name.
  ?w foaf:nick ?nick.
  ?w foaf:dataOfBirth ?dob.
  ?w foaf:depiction ?picture.

can frequently be answered by isolated FOAF documents. Merging a bunch
of FOAF documents and performing this query will frequently give you
no more results than querying the individual documents and merging
(does that mean striking obvious duplicates?) the results.

===== A Practical Union Query =====

FOAF documents seldom have assertions of the form
  A foaf:knows B. B foaf:depiction C.
so graph merging is required to answer queries like:

  ?who foaf:depiction ?pic1.
  ?who foaf:knows ?whom.
  ?whom foaf:depiction ?pic2.

FOAF documents tend to be small so merging the graphs isn't too painful.

===== A Practical Federate Query =====

The CDDB/IMDB example is fairly compelling because we have an
intuitive sense of what sort of data those sites serve and how it
would be organized. For consistency with the earlier FOAF examples,
imagine a couple of predicate-centric FOAF repositories. The first
stores data like:

KnowsDB
  A foaf:mbox EMAILA.
  A foaf:knows B.
  B foaf:mbox EMAILB.

and the other stores pictures:

PicDB
  A foaf:mbox EMAILA.
  A foaf:depiction PICA.

If we want the pictures of bob@example.com's friends

  ?who foaf:mbox <mailto:bob@example.com>.
  ?who foaf:knows ?whom.
  ?whom foaf:depiction ?pic.

an Aggregate Query will get zero results, a Union Query will require
the merge of two potentially huge data sources, and the properly
formed Federated Query

KnowsDB:
  ?who foaf:mbox <mailto:bob@example.com>.
  ?who foaf:knows ?whom.
  ?whom foaf:mbox ?mbox.
PicDB:
  ?w foaf:mbox ?mbox.
  ?w foaf:depiction ?pic.

will answer the question as efficiently as the protcol permits.

===== Executing a Federated Query =====

To date, my implementations are more simple than the efficient. The
first query returns a set of results and the second query is
iteratively performed for each of those results, ala:

result of first query:
  +--------------------------+
  |                    ?mbox |
  +--------------------------+
  | <mailto:sue@example.com> |
  | <mailto:ann@example.com> |
  | <mailto:joe@example.com> |
  | <mailto:frd@example.com> |
  | <mailto:pat@example.com> |
  +--------------------------+

following queries:
  ?w foaf:mbox <mailto:sue@example.com>.
  ?w foaf:depiction ?pic.
==>
  +-------------------------------+
  |                          ?pic |
  +-------------------------------+
  | <http://example.com/pics/sue> |
  +-------------------------------+

  ?w foaf:mbox <mailto:ann@example.com>.
  ?w foaf:depiction ?pic.
==>
  +-------------------------------+
  |                          ?pic |
  +-------------------------------+
  | <http://example.com/pics/ann> |
  +-------------------------------+
...

This costs (in this example, five) round trips but is still more
efficient than blindly scooping up all the triples in both the KnowsDB
and the PicDB.

===== Parameterized Query =====

Because we had already bound ?mbox, we were able to transform
  ?w foaf:mbox ?mbox.
  ?w foaf:depiction ?pic.
to
  ?w foaf:mbox <mailto:sue@example.com>.
  ?w foaf:depiction ?pic.
,
  ?w foaf:mbox <mailto:ann@example.com>.
  ?w foaf:depiction ?pic.
...

This is analogous to conventional parameterized queries *:

  q = prepare("W foaf:mbox ?. W foaf:depiction PIC.")
  r1 = q.execute("<mailto:sue@example.com>")
  r2 = q.execute("<mailto:ann@example.com>")
  ...

* I actually find it much safer than conventional parameterized
queries because you can use named parameters rather than simply
substituting the parameters to execute for the '?'s in the query
in the order that they appear.

===== Federated Query With Premise =====

If the protocol supported a premise, we could ship the earlier results
and have the picture query answered in one round trip:

PicDB:
  GIVEN:
  +--------------------------+
  |                    ?mbox |
  +--------------------------+
  | <mailto:sue@example.com> |
  | <mailto:ann@example.com> |
  | <mailto:joe@example.com> |
  | <mailto:frd@example.com> |
  | <mailto:pat@example.com> |
  +--------------------------+
  ?w foaf:mbox ?mbox.
  ?w foaf:depiction ?pic.
==>
  +--------------------------+-------------------------------+
  |                    ?mbox |                          ?pic |
  +--------------------------+-------------------------------+
  | <mailto:sue@example.com> | <http://example.com/pics/sue> |
  | <mailto:ann@example.com> | <http://example.com/pics/ann> |
  | <mailto:joe@example.com> | <http://example.com/pics/joe> |
  | <mailto:frd@example.com> | <http://example.com/pics/frd> |
  | <mailto:pat@example.com> | <http://example.com/pics/pat> |
  +--------------------------+-------------------------------+

I can imagine providing a premise mechanism in V2 of DAWG-QL. We have
a slightly simpler language without the premise and we can answer the
same queries (less efficiently) if we limit our language to targeted
graph patterns and make the client do the iteration/unification.

But then, maybe it's not too much work to put premises in the
language, I dunno.

> As an aside, I wanted to observe that the ebXML Registry use case is 
> also touching on both the federated query and parameterized query.
> 
> -- 
> Regards,
> Farrukh
> 
> 
> >
> >    Andy
> >
> >[1] 
> >http://lists.w3.org/Archives/Public/public-rdf-dawg/2004JulSep/0090.html
> >
> >Eric Prud'hommeaux wrote:
> >
> >>Joe wants to see what top 10 movies of also had top-ten soundtracks.
> >>IMDB has information about movies and CDDB has info about music. Joe
> >>writes a query that gets the titles of all the top 10 movies. These
> >>are boudn to the variable ?t. He uses those bindings for ?t to then
> >>query IMDB to filter out the ones that did not have top 10
> >>soundtracks.
> >>
> >>from CDDB:
> >>CONSTRUCT (?t foo bar)
> >>WHERE (?m tt:rank ?rc)
> >>      (?m cddb:soundtrack ?s)
> >>      (?s dc:title ?t)
> >>AND ?rc <= 10
> >>
> >>from IMDB:
> >>SELECT ?t
> >>WHERE (?r tt:rank ?ri)
> >>      (?r dc:title ?t)
> >>AND ?ri <= 10
> >>
> >>
> >>This needs the ability to use variables bound in an earlier query to
> >>constrain later queries. It also requires some sort of query
> >>targeting. In algae, this looks like:
> >>
> >>ns tt=<...> ns cddb=<...> ns dc=<...>
> >>attach <http://www.w3.org/...#remoteQuery> ?cddb (
> >>            server=<http://cddb.com/rq>)
> >>ask ?cddb ( ?m tt:rank ?rc {?rc <= 10} .
> >>        ?m cddb:soundtrack ?s .
> >>        ?s dc:title ?t )
> >>attach <http://www.w3.org/...#remoteQuery> ?imdb (
> >>            server=<http://imdb.com/querySrvc>)
> >>ask ?cddb ( ?r tt:rank ?ri {?ri <= 10} .
> >>        ?r dc:title ?t )
> >>collect (?t)
> >>
> >>
> >>This is very different from (more rigorous and expensive than) our
> >>current definition of aggregate query [1]. Such a query would, if the
> >>data is divided as the above queries suggest, return zero results.
> >>
> >>[1] http://www.w3.org/2001/sw/DataAccess/UseCases#d4.5
> >
> >
> 
> 

-- 
-eric

office: +81.466.49.1170 W3C, Keio Research Institute at SFC,
                        Shonan Fujisawa Campus, Keio University,
                        5322 Endo, Fujisawa, Kanagawa 252-8520
                        JAPAN
        +1.617.258.5741 NE43-344, MIT, Cambridge, MA 02144 USA
cell:   +1.857.222.5741 (does not work in Asia)

(eric@w3.org)
Feel free to forward this message to any list for any purpose other than
email address distribution.
Received on Wednesday, 28 July 2004 11:14:42 UTC