[Fwd: SPARQL: SOURCE is suboptimal]

This is not an editorial comment.

http://lists.w3.org/Archives/Public/public-rdf-dawg-comments/2004Nov/0020.html

The current sections 8 and 9 are a big change from the working draft in terms of 
describing FROM and SOURCE.  I try to avoid the details, which have changed or 
been clarified, and sticking to extract the high level issues for DAWG as I see 
them.

This is part 1 - discussion of the note.

To follow: a proposal.

	Andy

Tim Berners-Lee wrote:
   > Reading the draft of 2004-10-13
   >
   > The current specification of SOURCE assumes a particular sort of
   > application, which will not necessarily be more common than any other.
   > As a result, SPARQL as a query language lacks the flexibility to do the
   > general job of giving or querying metadata about the source of
   > information.
   >
   > SOURCE and FROM are muddled, and bite off part of  a general question
   > without solving it in general.

First issue: assembling graphs

   > . Behind the SOURCE feature is the implicit notion that the database
   > being queried is a conjunction of graphs each corresponding to web
   > resources.  The concept of the graph itself is not surfaced, but the
   > URI of the graph is the thing bound to. Meanwhile, servers have the
   > option of ignoring that structure and ignoring the binding of the
   > SOURCE variable.   This seems to me fuzzy.
   >
   > In fact, the database being queried may be generated in many ways, in
   > particular a triple may have arisen from a combination of triples in
   > different databases.
   >
   > Random example 0:
   >
   > foo.rdf:       mary  foaf:phone   1234.
   > bar.rdf:	    mary   owl:sameAs   maryJ
   >
   > query includes
   >
   > 		 SOURCE  ?s  {  maryJ   foaf:phone   ?y }.
   >
   > The natural result is to bind s to  a bnode expressing the virtual
   > graph which was formed

This can be tested by what happens when the query says:

     SELECT ?s WHERE ...

or

     CONSTRUCT (?s :p :q) WHERE ...

   >
   >
   > 	<foo.rdf>    log:semantics   ?f.
   >         <bar.rdf>    log:semantics   ?g.
   >        ( ?f ?g  )     log:conjunction  ?h.
   >          ?h            owldl:closure     ?s.
   >
   > There are a lot of combinations possible here of course, and many
   > complex things which will happen in the future.

I read the issue as being the dynamic construction of the graph in ?s.  So far,
SPARQL has said that query is over an RDF graph or the merge of named graphs -
how the graphs came about is outside the query language.  In particular, the OWL
closure of some graph with URI <g> is some other graph (with different URI) and
that would be the data graph.

The SPARQL FROM clause, and the underlying model of graphs, gives one, and only 
one, way of having constructed graphs.  The case is that there are others.  The 
case is also arguing for dynamic creation of the (virtual) graphs.

   >
   > That sort of graph could be returned in the query.

That would seem to be making a SPARQL service rather more of a general
computation.  We might enable such things but I don't think we can hope to
define such, whatever syntax is used.  This seems to be on the boundary with
rules anyway and a solution would need to work in that domain as well.

   >  It could also be
   > sent with the query to describe what has to be done.  If you like, it
   > is a clear RDF expressionof the sort of thing which will otherwise get
   > relegated to more and more complex non-RDF syntax or server command
   > line out of band forms.
   >

Second issue: Merge of graphs implies trust : can't do the untrusted use cases.

   >
   > There is an assumption, in the SOURCE feature, that when multiple
   > graphs exist, then  they are all believed.  This is IMHO a major and
   > quite unnecessary flaw.  Many systems will need to be distrustful of
   > most data.  So I'd like to be able to use the SOURCE feature, which
   > overlaps with the FROM feature, so that *either* one is talking about
   > explicitly mentioned resources as the source to be queried, *OR* there
   > is a default knowledge base for the service.
   >
   > When both are used, then the default KB can be a meta-kb which allows
   > the kbs being processed to be constrained and defined.
   >
   > The feature of returning NULL but continuing should be dropped. The
   > whole idea of having things continuing when data when a requested
   > feature wasn't implemented I think is asking for interoperability
   > problems.
   >
   > One way to clean it up is to make a SOURCE variable must be bound
   > elsewhere.  this would mean that the set of resources which are queried
   > becomes explicit.
   > Otherwise we have added two implicit things to the SPARQL service --
   > the implicit set of sources and the impliciit kb.

FROM <u1> <u2> ... currently means that all the triples from the two graphs get
asserted into the merged graph.  This does two things: it means patterns can
match by spanning triples in different graphs will work but it also means that
all the statements are asserted as "true" in the data graph.  We should think 
about this.

   >
   > Random Example 1:
   >
   > SELECT ?x ,...
   > WHERE
   >              ?y   roogle:search    "Mary".
   >              SOURCE ?y      {  ?x  firstName "Mary" ...
   >
   > So the default KB is defined for this server to know about
   > roogle:search which relates documents which contain strings to those
   > strings.

What is wanted is no automatic connection between the default graph and the
named graphs.  In other words, no autmatic merge.  Seems reasonable.

Query is over a default (unamed graph) acceses by patterns (?x ?y ?z) and also a
collection of named graphs, accessed by SOURCE (log:includes in cwm).

The people with systems that address might like to comment on the utility of
querying over all the graphs uniformly.

   >
   > Random Example 2:
   >
   >
   > SELECT ?x, ...
   > WHERE
   >        ?x    rdf:type QualifiedIndividual.
   >        ?x    address:countrycode "fr".
   >        ...
   >        ?x    foaf:personalProfile  ?p.
   >        SOURCE ?p       { ?x diet:preference  ?z   }
   >       ...
   >
   > Here the main database if trusted. The mass of FOAF out there isn't.
   > Just for one item, the query tests the person's personal profile to see
   > what they declare themselves as a vegetarian.  The bulk of the query is
   > on a trusted database, and by default only that database is trusted.
   > This is an application where the idea that all known graphs are trusted
   > by default breaks.
   >
   > Conclusion:
   >
   > The current specification of SOURCE assumes a particular sort of
   > application, which will not necessarily be more common than any other.
   > As a result, SPARQL as a query language lacks the flexibility to do the
   > general job of giving or querying metadata about the source of
   > information.
   >
   > A better solution is to used RDF graphs for the metadata in query
   > and/or in the returned information.

There is a usage pattern of collecting RDF from a number of data sources,
creating the aggregation (loose sense) and querying that.  In Enterprise
situations, data sources are often trusted because they are part of the
enterprise or come from reputable providers.

For better or worse, aggregation is also used internally in systems - data is
received from one place, extra material added or another source used to augment
the original source and the whole thing "published" by query but the provider
wishes to keep the different peices separate for data management reasons.

As such, all these use cases need to be considered in any solution.

Received on Thursday, 25 November 2004 16:02:22 UTC