RE: ACTION: discuss & promote union query (Was: ACTION: a replacement for 4.5 focussed on union query) from Rob Shearer on 2004-08-24 (public-rdf-dawg@w3.org from July to September 2004)

From: Rob Shearer <Rob.Shearer@networkinference.com>
Date: Tue, 24 Aug 2004 09:24:14 -0700
To: "Simon Raboczi" <raboczi@tucanatech.com>, "RDF Data Access Working Group" <public-rdf-dawg@w3.org>
Message-ID: <CFE388CECDDB1E43AB1F60136BEB497302812E@rome.ad.networkinference.com>
> [[
> 4.5 Querying multiple sources
> 
> It should be possible for a query to specify which of the 
> available RDF 
> graphs it is to be executed against.  If more than one RDF graph is 
> specified, the result is as it the query had been executed 
> against the 
> merge[1] of the specified RDF graphs.

I continue to feel that this feature, and beyond this objective the
specific approach taken by BRQL, do more harm than good. Unlike SQL
data, RDF is not segmented into tables and thus within a particular
server there is absolutely no need to target a particular "piece" of
RDF. It makes sense to me for source selection to be performed at a
different level than the query language, such as the network protocol.

More importantly, the ability to aggregate graphs seems quite orthogonal
to the ability to query an RDF graph. RDF aggregators will undoubtedly
be important RDF applications, but in general I expect far more servers
to only support queries against their one and only RDF store. Adding the
feature can only hurt the development of truly general aggregation
functionality.

> Some services only offer to query one graph; they are considered to 
> trivially satisfy this objective.

This would be a rather absurd limit to the functionality of that
feature; what we're standardizing here is an optional feature which is
by default completely unimplemented (but conformant) and in the vast
majority of cases will only be partially supported (because it will only
allow certain things to be aggregated).

An aggregator is an application in its own right. Why are we
standardizing the functionality of aggregators? The whole point of RDF
is that you can merge two graphs and the result is a single graph. Let
RDF be RDF. Let aggregators be aggregators. Let a query language for RDF
query RDF, and not some more complex data structure which partitions
data in ways RDF does not.

> While a variety of use cases motivate this feature, it is not a 
> requirement because it is not clear whether this feature can be 
> implemented in a generally scalable fashion.
> ]]
> 
> Much as Requirement 3.1 "RDF Graph Pattern Matching -- 
> Conjunction" and 
> 3.13 "RDF Graph Pattern Matching -- Disjunction" each introduce a 
> single operator (conjunction and disjunction respectively) into the 
> WHERE clause, this proposal would introduce a set union/graph merge 
> operator into the SOURCE/FROM clause.  (The current BRQL 
> grammar[2] in 
> fact already covers this -- the SOURCE/FROM clause can take a list of 
> documents to be merged.)
> 
> The argument I'm about to make in favor of multiple sources is that 
> it's going to make the query model simpler rather than more 
> complicated.  This is because 4.5 has the power to satisfy several 
> other requirements and objectives simultaneously.

I disagree with this contention. The ability to aggregate graphs is
orthoganal from the ability to supplement a single graph with
query-processor-specific supplementary arcs.

>  The simplifying 
> principle is that we should never need to deal with anything 
> that isn't 
> graph.  When we do this, we have to add new grammar and query 
> modeling 
> to deal with these non-graph entities.   Rather, make everything the 
> query language needs to deal with into a graph that the WHERE clause 
> can deal with.
> 
> These are some of the other requirements and objectives we could 
> satisfy purely by defining graphs and querying merges of these graphs 
> with the base facts, rather than by adding grammar:
> 
> 
> * 3.3 "Extensible Value Testing"
> 
>    A monadic domain-specific function can be represented as a 
> property 
> taking its argument as the subject and returning its result as the 
> object.  Graph patterns can then be used to evaluate the function or 
> its inverse.  For example, the graph pattern { ?angle trig:cosine 
> "0.5"^^xsd:double } could bind ?angle to "60"^^trig:degrees and 
> "300"^^trig:degrees.  Conceptually a trigonometry library is just a 
> graph containing an infinite number of triples (including { 
> "60"^^trig:degrees trig:cosine "0.5"^^xsd:double } and { 
> "300"^^trig:degrees trig:cosine "0.5"^^xsd:double }).  In practice, 
> constraints resolved against the "infinite" graph produce finite 
> variable bindings by algorithmic means rather than by consulting a 
> store.  Note that absolutely no special case grammatical support is 
> required -- extensibility is just a matter of the graph that 
> represents 
> the extended function being made available to the query service.  The 
> query processor knows which extensions are required by a 
> query because 
> the graph which implements the extension appears explicitly in the 
> SOURCE/FROM clause.

But using SOURCE/FROM is almost certainly NOT what you'd want to do--you
don't want to aggregate your particular RDF graph with some infinite
graph you grab from somewhere. That infinite graph can never be
realized. You'd need to add special functionality to your query
processor to mimic its consequences.

Expressing such value tests as triples is certainly appealing, but its
appeal lies in its simplicity for the language's formal model and for
the syntax. The implementation doesn't get any easier and you don't get
this feature "for free" just by being able to aggregate graphs.

>    One thing we do have to deal with once we introduce graphs of 
> infinite size is safety -- the possibility that a query might not be 
> constrained to a finite number of variable bindings.  For 
> example, the 
> constraint { ?angle trig:cosine ?cos } is unsafe and unable to be 
> converted into a finite set of variable bindings.  What will normally 
> happen during query resolution is that some of the variables in the 
> unsafe constraint will become bound by others constraints, 
> reducing the 
> unsafe constraint to a safe form.  If this doesn't occur, I 
> think it'd 
> be quite acceptable for the query processor to simply tell the user 
> that the query is underconstrained.
> 
>    Dyadic and higher functions are admittedly less pleasant to deal 
> with, although there are solutions (currying[4], or 
> constructing topic 
> map -style association within the query spring to mind as 
> possibilities).

The problems you bring up are issues we've faced. Data values and
datatypes are fundamentally hard problems in RDF because RDF is so poor
in the specifics of these things. Either you allow too little (OWL makes
it hard to use simple ranges of numbers) or too much (full XML Schema
datatypes are arbitrarily difficult to reason about).

> * 3.7 "Limited Datatype Support"
> 
>    Datatype support can be almost entirely considered as a kind of 
> extensible value testing.  Datatypes require the following 
> functions to 
> be defined[3]:
> 
>    - the membership of its lexical space
>    - the membership of its value space
>    - the lexical-to-value mapping
>    - domain-specific functions (e.g. signum, length)
> 
>    So our limited support for XSD could notionally be a graph 
> asserting 
> an infinite number of triples, including the following:
> 
>    xsd:double          x:lexicalMember  "3.14"              # lexical 
> space
>    xsd:double          x:valueMember    "3.14"^^xsd:double  # 
> value space
>    "3.14"^^xsd:double  x:lexicalForm    "3.14"              # 
> L2V mapping
>    "3"^^xsd:integer    x:lessThan       "8"^^xsd:integer    # 
> domain-specific
>    "3.14"^^xsd:double  x:signum         "1"^^xsd:integer    # 
> domain-specific
> 
>    The separate AND clause with its own grammar in BRQL has always 
> bugged me.  Datatyping constraints make perfect sense as first-class 
> citizens in the WHERE clause -- the predicate ought to be enough to 
> distinguish whether a constraint needs to be resolved from the triple 
> store or the datatype processor.
> 
>    Note that to make this work, the graph has to permit literals as 
> subjects.  (Can someone explain to me why normal RDF graphs don't 
> permit this?  I've never seen an explanation of this restriction.)

Again, simple aggregation doesn't get you any of this. If you want a
spiffy "values as graphs" syntax, then you can do it without aggregation
and answer queries under the assumption that your original graph was
supplemented by all these assertions. Common graph syntax is attractive,
but I can't imagine that demanding a user write aggregation clauses
every time they want to test an integer can be viewed as a feature.

> * 4.8 "Literal Search"
> 
>    Like datatype support, literal search can just be a 
> specific instance 
> of extensible value testing.  Provide a graph that defines the 
> substring predicate on plain literals:
> 
>    "cat" x:substring "c"
>    "cat" x:substring "a"
>    "cat" x:substring "t"
>    "cat" x:substring "ca"
>    "cat" x:substring "at"
>    "cat" x:substring "cat"
>    ... etc ...
> 
>    It would seem most convenient to include these triples as 
> part of the 
> same graph that provides the limited XSD support, forming something 
> similar to the "standard library" in a programming language.

I'm a broken record: you can't get this functionality by simply
aggregating the graphs, because the graphs are infinite.
Received on Tuesday, 24 August 2004 16:27:01 UTC