FROM vs. SOURCE (Was: RE: ACTION: discuss & promote union query (Was: ACTION: a replace ment for 4.5focussed on union query) ) from Alberto Reggiori on 2004-08-25 (public-rdf-dawg@w3.org from July to September 2004)

From: Alberto Reggiori <alberto@asemantics.com>
Date: Wed, 25 Aug 2004 03:48:20 -0700 (PDT)
To: "Seaborne, Andy" <andy.seaborne@hp.com>
cc: RDF Data Access Working Group <public-rdf-dawg@w3.org>
Message-ID: <20040825034144.W51243@skutsje.san.webweaving.org>
On Tue, 24 Aug 2004, Seaborne, Andy wrote:

>
> -------- Original Message --------
> > From: Rob Shearer <>
> > Date: 24 August 2004 17:27
> >
> > > [[
> > > 4.5 Querying multiple sources
> > >
> > > It should be possible for a query to specify which of the
> > > available RDF
> > > graphs it is to be executed against.  If more than one RDF graph is
> > > specified, the result is as it the query had been executed
> > > against the
> > > merge[1] of the specified RDF graphs.
> >
> > I continue to feel that this feature, and beyond this objective the
> > specific approach taken by BRQL, do more harm than good. Unlike SQL
> > data, RDF is not segmented into tables and thus within a particular
> > server there is absolutely no need to target a particular "piece" of
> > RDF.
>
> I agree with the sentiment here.
>
> I not actually sure what approach in BRQL is being referred to as there is
> confusion (in my mind at least) between FROM (merge some graphs then execute
> query over that) and SOURCE (quads and provenance on a per triple, maybe per
> graph pattern basis).

I have been also thinking quite a lot of what could be the relationship in
BRQL between FROM and SOURCE clauses - and their usage. And I still feel
is important to keep them separated. Both can pin-point to valid RDF
resources, and particularly the SOURCE/provenance resource can also be a
bNode if necessary. Of what I can think of these are the three major usage
cases:

1) FROM: is used to select one or more sources on which to run the query.
The query has to be run on the unione/merge graph has specified in the
rdf-mt. If more than one source is specified the two graphs has to be
merged/smushed together and the DB interface has to provide a homogeneous
and unique view of the merged graphs. The case of one single FROM is
trivial. No FROM it means local/DBC API connect() specified. It is up to
the software and DB interface how to actually provide the union/merge of
the two graphs. The FROM source can either a local or remote resource and
different protocols can be used to process/retriev the union graph (e.g.
HTTP, custom tcp/ip, JDBC or others)

E.g. select the title, link, date and description of the rss:items created
by mailto:danbri@w3.org friends

SELECT
        ?title ?link ?description ?date
FROM
        <http://rdfweb.org/people/danbri/rdfweb/danbri-foaf.rdf>
        <http://planet.rdfhack.com/index.rdf>
WHERE
        (?person <rdf:type> <foaf:Person>)
        (?person <foaf:mbox> <mailto:danbri@w3.org>)
        (?person <foaf:knows> ?friend)
        (?friend <foaf:name> ?creator)
        (?item <rdf:type> <rss:item>)
        (?item <rss:title> ?title)
        (?item <rss:description> ?description)
        (?item <rss:link> ?link)
        (?item <dc:date> ?date)
        (?item <dc:creator> ?creator)

2) SOURCE: can be used to pin-point / select specific properties of the
sources (and further select specific graph-patterns) - or source URI
directly [1]. The BRQL query has to be run over the FROM selected sources
and its resulting union/merged graph still. The SOURCE can be a URI
spelled out, or a bNode referenced-by-description and then a variable. In
both cases the SOURCE variable can be returned into the bindings if
requested in the SELECT part. Now how the query software will "associate"
the source/provenance SOURCE information to the actual FROM clause parts /
merged graph (or local DBC connection) is application/protocol specific
perhaps - and different layers of semantics can be put on the top of this
E.g. when the BRQL engine process the query is taking care of keeping
track of provenance information of different sources specified in the FROM
clause before the actual merged happen - and then have the possibility to
pin-point / refer to specific sub-graphs of the merged graph using SOURCE
information. Or differently have the source/provenance information
"inlined" into the RDF/XML or N-Triples (Quads) graph itself, having a
specific RDF parser/tool being able to extract and use such information
[2]. But they might not be a real "link" between the FROM clause parts and
the actual SOURCE(s).

E.g. given a big FOAF database, give me all user names and mbox of all
users harvested on 2004-05-27T04:34:00+01:00

SELECT
        ?name ?mbox
FROM
        <rdfstore://www.foo.com/a/very/big/foaf/database>
WHERE

        SOURCE ?PPD (?person <rdf:type> <foaf:Person>)
        SOURCE ?PPD (?person <foaf:mbox> ?mbox)
        SOURCE ?PPD (?person <foaf:name> ?name)
        (?PPD <rdf:type> <foaf:PersonalProfileDocument>)
        (?PPD <dcq:modified> " 2004-05-27T04:34:00+01:00")

3) SOURCE: use SOURCE to actually federate/distribute or chain the query
over several different RDF sources. Where some sources might local or
remote. Some might be specified into the FROM clause like in the cases
above, while others might be specified attaching DAWG specific RDF
properties (e.g. dc:source) and refer to such sources by-description into
the graph-patterns. The actual query is then being run on (possibly)
several graphs, where one of them might as well be the union / merged as
specified into the FROM clause. In other words the query spawns several
interconnected graphs, which might reside on different machines/URLs. This
has very *direct* network / Web effect on the query, and clearly allows
the user to really join different pieces of RDF sitting on different
servers. This is different from have them all specified in the FROM
clause, due that one would require a full merge/union/smush of the
corresponding graphs. Instead, in this case the FROM clause might not even
specified and the actual sources on which to run the query onto are
specified into the graph-patterns. This does not require any real merge of
the graphs as such, but some simpler streaming chain where actually the
query is run over. It is in fact possible to fire each independent
graph-pattern over different RDF sources and join the results together -
or even to connect each single source and iterate over the graph-patterns
and send the request to the right backend/source/DBC interface to get the
next match. Constraints might be either be "global" per query of local per
graph-pattern and then pushed down to the specific DB backend if needed.

E.g. we could run the query of the first example "select the title, link,
date and description of the rss:items created by mailto:danbri@w3.org
friends" only using the SOURCE clauses (note: no FROM clause, but could be
specified and its union/merge graph would contribute to the triple-base
with other sources)

SELECT
        ?title ?link ?description ?date
WHERE
        (?FOAF <rdf:type> <foaf: PersonalProfileDocument)
        (?FOAF <dc:source> <http://rdfweb.org/people/danbri/rdfweb/danbri-foaf.rdf>) // connect to the FOAF source (could be rdfs:seeAlso)
        SOURCE ?FOAF (?person <rdf:type> <foaf:Person>)
        SOURCE ?FOAF (?person <foaf:mbox> <mailto:danbri@w3.org>)
        SOURCE ?FOAF (?person <foaf:knows> ?friend)
        SOURCE ?FOAF (?friend <foaf:name> ?creator)
        (?RSS <rdf:type> <rss:channel>) // not relevant
        (?RSS <dc:source> <http://planet.rdfhack.com/index.rdf>) // connect to the RSS source
        SOURCE ?RSS   (?item <rdf:type> <rss:item>)
        SOURCE ?RSS   (?item <rss:title> ?title)
        SOURCE ?RSS   (?item <rss:description> ?description)
        SOURCE ?RSS   (?item <rss:link> ?link)
        SOURCE ?RSS   (?item <dc:date> ?date)
        SOURCE ?RSS   (?item <dc:creator> ?creator)

Each part of the distributed/federated query can be of course run
independently and joined at the API/DBC level - but would require more
work and would not allow to declare how to actually "join" the source
information. It is worth noting that this solution would allow a fully
stream over different sources and quite efficiently - just need some
simple API getnextResult() or something.

A more articulated example might be the following:

E.g. give me the location, level and temperature of the merge of the meteo
and time-series databases

SELECT
          ?location ?level ?temperature
WHERE
      (?db1 <rdf:type> <rdfstore:Source>)
      (?db1 <dc:source> <rdfstore://dnmi.no/weather/temperature>) // connect to RDF db1
      SOURCE ?db1 (?item <rdf:type> <rss:item> )
      SOURCE ?db1 (?item <geo:name> ?location )
      SOURCE ?db1 (?item <meteo:temperature> ?temperature )
      (?db2 <rdf:type> <rdfstore:Sql_Source>)
      (?db2 <dc:source> <jdbc://statkraft.no/water/magazines>) // connect to db2 (SQL one) perhaps mapped with D2R or something
      SOURCE ?db2 (?magazine <rdf:type> <hydro:Time_Series> )
      SOURCE ?db2 (?magazine <hydro:location> ?location )
      SOURCE ?db2 (?magazine <hydro:water_level> ?level )
AND
       (?level > 50)
USING
 geo for <http://www.w3.org/2003/01/geo/wgs84_pos#>
 rss for <http://purl.org/rss/1.0/>
 rdf for <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
 meteo for <http://www.dbmi.no/meteorology#>
 hydro for <http://www.statkraft.no/hydrology#>
 dc for <http://purl.org/dc/1.1/>

does it make sense?

Alberto

[1] http://lists.w3.org/Archives/Public/public-rdf-dawg/2004JulSep/0307.html
[2] http://lists.w3.org/Archives/Public/www-rdf-interest/2004Feb/0209.html


>
> That said, it is the semantic WEB - RDF graphs, as documents can live at
> locations on the web so there is some segmentation-like effect.  It does
> give rise to the problems in FROM/SOURCE about repeated triples from
> different sources, and inferencing over the combination.
>
Received on Wednesday, 25 August 2004 11:00:12 UTC