Use case: AR-2: "Federated query"

== Use Case Name

Federated query

== Intent: Task & Roles

Actor/User Agent needs to seamlessly query/access/integrate related  
chunks/pieces of data coming from a set of decentralized heterogeneous  
sources, and get presented an unified view over a the whole  
result-set/data-set.

== Key Benefits / Value

Most of existing DBMS query systems are mostly centralized or subsumes  
some kind of central authority/control (and constraints) over their  
whole database architecture. What is needed over the Web is a system  
that allows fully federated queries over a bunch of distributed and  
heterogeneous sources/services/tables. Each source must be fully  
decoupled from the others; and must be able to retain its own workflow,  
schema and control/authority over its data. Each source only needs to  
be interfaced to the data federation through some kind of "proxy  
service" which allows to map its native data format or query-language  
to a common query/data format; and map results back and forth as  
requested. In other words, with a single query statement, the user can  
access and join tables located across multiple data sources without  
needing to know the source location.

== Description

The Web itself a good example of a federated system, providing dynamic  
direct and easy access to several different and heterogeneous  
information sources; search engines, image galleries, online travel  
agencies, online newspapers, online shops (e.g. Amazon [1]) are  
examples. Everybody can easily contribute to the Web by simply writing  
a piece of HTML and then publish it to a specific URI location. Links  
between similar pages can be easily set up without requiring any kind  
of centralized control and requiring few "integrity constraints" but  
naming "things" in a specific way; images can be as well be inlined  
inside pages by simply pin-pointing to their location URI. Then a  
specific Web browser application will take care of aggregating and  
assembling the hypertext in a unified view over a bunch of physically  
decentralized pages and related images.

Similarly most of the dynamic data available into DBMS systems is  
available on the Web. Unfortunately while doing so most of the  
semantics of the original database fields/tables is lost and most of  
the DBMS usage benefits are somehow lost too [2]. Generally only a  
limited set of search operations is made available to the end user a  
part plain free-text search. Web services are trying to overcome this  
problem with a more general XML based solution, by providing the user  
ad-hoc designed API to go beyond simple HTML human-interpretion. On the  
other side, such a technology did not proof to be general and flexible  
enough to solve most of the database federation problems yet. And this  
approach is suited but limited to a closed/vertical application  
domains.

Differently, RDF provides a more general and powerful framework built  
on the Web for the Web - it is expected that people will start to  
annotate their pages/services with RDF descriptions allowing a third  
part application to transparently query/aggregate Web resources.

Despite such a large set of solutions available to the user today, what  
is needed is a real federated query system which spawn several virtual  
database tables/resources/services.

The query system must provide a user-friendly syntax and a standard  
API/protocol to express query statements over one or more distributed  
data sources - data sources might be Web pages, XML documents, DBMS,  
ad-hoc Web Services or any RDF metadata source. Each source might  
interface to the query federation system in many different ways [3-12].  
  The query processing engine then has to split up the input query in  
several different sub-queries, to be run on each system, apply the  
constraints, join the results back and return to the user. Each result  
will then have to retain its full provenance/source information to  
allow the user to pose more queries in a second time eventually. In the  
easiest and most general case the query system will be simply provide a  
way to SELECT a certain number of fields/tables. Full DML functionality  
will be better tackled in the original sources using existing DBMS  
tools. If any of the sub-queries can not be run or fails to join in the  
main query, an empty result set is returned to the user.

== Other

=== Notes

This use cases subsumes some extensive/systematic query optimization,  
caching and other important technical/technological aspects not  
considered here. As well as the need to globally uniquely identify/name  
concepts/objects/relations and tables to make the model really fully  
federated (e.g. definition of a URI/URN scheme and resolution  
protocol). In relation to the DAWG work we are only/mostly interested  
to the data access/query syntax/protocol more than the  
technical/architectural choices which an system designer/implementor  
would need to consider/stick-to.

=== Applicability/Scale

Real-time data, Legacy data/services, External services

=== Related systems/cases

RDF Access to Relational Databases -  
http://www.w3.org/2003/01/21-RDF-RDB-access/

== References

[1] http://www.amazon.com
[2]  
http://www.igd.fhg.de/archive/1995_www95/proceedings/papers/54/ 
darm.html
[3] http://rdfweb.org/2002/02/java/squish2sql/intro.html
[4] http://www.wiwiss.fu-berlin.de/suhl/bizer/d2rmap/D2Rmap.htm
[5] http://kaon.semanticweb.org/alphaworld/reverse/view
[6] http://www.w3.org/2000/10/swap/dbork/dbview.py
[7] http://www.openlinksw.com/virtuoso/
[8] http://www.picdiary.com/triplequerying/
[9] http://iconocla.st/~sderle/squish.pl
[10]  
http://www.w3.org/2001/sw/Europe/reports/scalable_rdbms_mapping_report/
[11] http://www.w3.org/2002/02/21-WSDL-RDF-mapping/
[12] http://www.w3.org/DesignIssues/RDB-RDF.html

Received on Wednesday, 17 March 2004 15:42:02 UTC