- From: Alberto Reggiori <alberto@asemantics.com>
- Date: Wed, 25 Aug 2004 03:48:20 -0700 (PDT)
- To: "Seaborne, Andy" <andy.seaborne@hp.com>
- cc: RDF Data Access Working Group <public-rdf-dawg@w3.org>
On Tue, 24 Aug 2004, Seaborne, Andy wrote: > > -------- Original Message -------- > > From: Rob Shearer <> > > Date: 24 August 2004 17:27 > > > > > [[ > > > 4.5 Querying multiple sources > > > > > > It should be possible for a query to specify which of the > > > available RDF > > > graphs it is to be executed against. If more than one RDF graph is > > > specified, the result is as it the query had been executed > > > against the > > > merge[1] of the specified RDF graphs. > > > > I continue to feel that this feature, and beyond this objective the > > specific approach taken by BRQL, do more harm than good. Unlike SQL > > data, RDF is not segmented into tables and thus within a particular > > server there is absolutely no need to target a particular "piece" of > > RDF. > > I agree with the sentiment here. > > I not actually sure what approach in BRQL is being referred to as there is > confusion (in my mind at least) between FROM (merge some graphs then execute > query over that) and SOURCE (quads and provenance on a per triple, maybe per > graph pattern basis). I have been also thinking quite a lot of what could be the relationship in BRQL between FROM and SOURCE clauses - and their usage. And I still feel is important to keep them separated. Both can pin-point to valid RDF resources, and particularly the SOURCE/provenance resource can also be a bNode if necessary. Of what I can think of these are the three major usage cases: 1) FROM: is used to select one or more sources on which to run the query. The query has to be run on the unione/merge graph has specified in the rdf-mt. If more than one source is specified the two graphs has to be merged/smushed together and the DB interface has to provide a homogeneous and unique view of the merged graphs. The case of one single FROM is trivial. No FROM it means local/DBC API connect() specified. It is up to the software and DB interface how to actually provide the union/merge of the two graphs. The FROM source can either a local or remote resource and different protocols can be used to process/retriev the union graph (e.g. HTTP, custom tcp/ip, JDBC or others) E.g. select the title, link, date and description of the rss:items created by mailto:danbri@w3.org friends SELECT ?title ?link ?description ?date FROM <http://rdfweb.org/people/danbri/rdfweb/danbri-foaf.rdf> <http://planet.rdfhack.com/index.rdf> WHERE (?person <rdf:type> <foaf:Person>) (?person <foaf:mbox> <mailto:danbri@w3.org>) (?person <foaf:knows> ?friend) (?friend <foaf:name> ?creator) (?item <rdf:type> <rss:item>) (?item <rss:title> ?title) (?item <rss:description> ?description) (?item <rss:link> ?link) (?item <dc:date> ?date) (?item <dc:creator> ?creator) 2) SOURCE: can be used to pin-point / select specific properties of the sources (and further select specific graph-patterns) - or source URI directly [1]. The BRQL query has to be run over the FROM selected sources and its resulting union/merged graph still. The SOURCE can be a URI spelled out, or a bNode referenced-by-description and then a variable. In both cases the SOURCE variable can be returned into the bindings if requested in the SELECT part. Now how the query software will "associate" the source/provenance SOURCE information to the actual FROM clause parts / merged graph (or local DBC connection) is application/protocol specific perhaps - and different layers of semantics can be put on the top of this E.g. when the BRQL engine process the query is taking care of keeping track of provenance information of different sources specified in the FROM clause before the actual merged happen - and then have the possibility to pin-point / refer to specific sub-graphs of the merged graph using SOURCE information. Or differently have the source/provenance information "inlined" into the RDF/XML or N-Triples (Quads) graph itself, having a specific RDF parser/tool being able to extract and use such information [2]. But they might not be a real "link" between the FROM clause parts and the actual SOURCE(s). E.g. given a big FOAF database, give me all user names and mbox of all users harvested on 2004-05-27T04:34:00+01:00 SELECT ?name ?mbox FROM <rdfstore://www.foo.com/a/very/big/foaf/database> WHERE SOURCE ?PPD (?person <rdf:type> <foaf:Person>) SOURCE ?PPD (?person <foaf:mbox> ?mbox) SOURCE ?PPD (?person <foaf:name> ?name) (?PPD <rdf:type> <foaf:PersonalProfileDocument>) (?PPD <dcq:modified> " 2004-05-27T04:34:00+01:00") 3) SOURCE: use SOURCE to actually federate/distribute or chain the query over several different RDF sources. Where some sources might local or remote. Some might be specified into the FROM clause like in the cases above, while others might be specified attaching DAWG specific RDF properties (e.g. dc:source) and refer to such sources by-description into the graph-patterns. The actual query is then being run on (possibly) several graphs, where one of them might as well be the union / merged as specified into the FROM clause. In other words the query spawns several interconnected graphs, which might reside on different machines/URLs. This has very *direct* network / Web effect on the query, and clearly allows the user to really join different pieces of RDF sitting on different servers. This is different from have them all specified in the FROM clause, due that one would require a full merge/union/smush of the corresponding graphs. Instead, in this case the FROM clause might not even specified and the actual sources on which to run the query onto are specified into the graph-patterns. This does not require any real merge of the graphs as such, but some simpler streaming chain where actually the query is run over. It is in fact possible to fire each independent graph-pattern over different RDF sources and join the results together - or even to connect each single source and iterate over the graph-patterns and send the request to the right backend/source/DBC interface to get the next match. Constraints might be either be "global" per query of local per graph-pattern and then pushed down to the specific DB backend if needed. E.g. we could run the query of the first example "select the title, link, date and description of the rss:items created by mailto:danbri@w3.org friends" only using the SOURCE clauses (note: no FROM clause, but could be specified and its union/merge graph would contribute to the triple-base with other sources) SELECT ?title ?link ?description ?date WHERE (?FOAF <rdf:type> <foaf: PersonalProfileDocument) (?FOAF <dc:source> <http://rdfweb.org/people/danbri/rdfweb/danbri-foaf.rdf>) // connect to the FOAF source (could be rdfs:seeAlso) SOURCE ?FOAF (?person <rdf:type> <foaf:Person>) SOURCE ?FOAF (?person <foaf:mbox> <mailto:danbri@w3.org>) SOURCE ?FOAF (?person <foaf:knows> ?friend) SOURCE ?FOAF (?friend <foaf:name> ?creator) (?RSS <rdf:type> <rss:channel>) // not relevant (?RSS <dc:source> <http://planet.rdfhack.com/index.rdf>) // connect to the RSS source SOURCE ?RSS (?item <rdf:type> <rss:item>) SOURCE ?RSS (?item <rss:title> ?title) SOURCE ?RSS (?item <rss:description> ?description) SOURCE ?RSS (?item <rss:link> ?link) SOURCE ?RSS (?item <dc:date> ?date) SOURCE ?RSS (?item <dc:creator> ?creator) Each part of the distributed/federated query can be of course run independently and joined at the API/DBC level - but would require more work and would not allow to declare how to actually "join" the source information. It is worth noting that this solution would allow a fully stream over different sources and quite efficiently - just need some simple API getnextResult() or something. A more articulated example might be the following: E.g. give me the location, level and temperature of the merge of the meteo and time-series databases SELECT ?location ?level ?temperature WHERE (?db1 <rdf:type> <rdfstore:Source>) (?db1 <dc:source> <rdfstore://dnmi.no/weather/temperature>) // connect to RDF db1 SOURCE ?db1 (?item <rdf:type> <rss:item> ) SOURCE ?db1 (?item <geo:name> ?location ) SOURCE ?db1 (?item <meteo:temperature> ?temperature ) (?db2 <rdf:type> <rdfstore:Sql_Source>) (?db2 <dc:source> <jdbc://statkraft.no/water/magazines>) // connect to db2 (SQL one) perhaps mapped with D2R or something SOURCE ?db2 (?magazine <rdf:type> <hydro:Time_Series> ) SOURCE ?db2 (?magazine <hydro:location> ?location ) SOURCE ?db2 (?magazine <hydro:water_level> ?level ) AND (?level > 50) USING geo for <http://www.w3.org/2003/01/geo/wgs84_pos#> rss for <http://purl.org/rss/1.0/> rdf for <http://www.w3.org/1999/02/22-rdf-syntax-ns#> meteo for <http://www.dbmi.no/meteorology#> hydro for <http://www.statkraft.no/hydrology#> dc for <http://purl.org/dc/1.1/> does it make sense? Alberto [1] http://lists.w3.org/Archives/Public/public-rdf-dawg/2004JulSep/0307.html [2] http://lists.w3.org/Archives/Public/www-rdf-interest/2004Feb/0209.html > > That said, it is the semantic WEB - RDF graphs, as documents can live at > locations on the web so there is some segmentation-like effect. It does > give rise to the problems in FROM/SOURCE about repeated triples from > different sources, and inferencing over the combination. >
Received on Wednesday, 25 August 2004 11:00:12 UTC