Re: SOURCE - Choosing what to query and querying the origin of statements from Seaborne, Andy on 2004-11-09 (public-rdf-dawg@w3.org from October to December 2004)

From: Seaborne, Andy <andy.seaborne@hp.com>
Date: Tue, 09 Nov 2004 17:21:30 +0000
To: RDF Data Access Working Group <public-rdf-dawg@w3.org>, Dave Beckett <dave.beckett@bristol.ac.uk>
Message-ID: <4190FC9A.3070708@hp.com>
Dave - I think this is good replacement for what we have in sections 8 and 9. 
I'd like to use this as the basis for further discussions if everyone thinks it 
is in the right sort fo place because we can work on the exact definitions.

	Andy

Comments and questions:


My understanding is that the cases we wish to handle include:

1/ Query over a collection of graphs
2/ Query over an RDF merge of graphs
3/ Query over a single graph (the easy case!) which may or may not be named.



Dave Beckett wrote:
> Here is a rough proposal to update to 
>   http://www.w3.org/TR/2004/WD-rdf-sparql-query-20041012/
> sections 8 and 9 with respect to the SOURCE issue.
> 
> There are plenty of issues related to this discussed already but
> let's see how this does.
> 
> A quick comparision to named graphs/named containers/earlier work
> - No access to individual named graphs (i.e. no SOURCE <uri>)

Could you explain this point - I don't understand why this makes it easier.

Can't this be indirectly achieved by

SELECT * WHERE
      (<blah> :p ?src)   // i.e anything that matches and binds ?src
      SOURCE ?src ....

> - It does imply dynamic RDF-merging, but it could be de-emphasised
>   that it's not required at run time, but the result must be as-if
>   that had been done.
> - No bnode graph names (issue)
> - Left out DISTINCT for now, that's a result thing

I though the debate had established that combination of graph union (no bNode
label overlaps) + distinct makes the case of query over a collection of graphs
and a query over the RDF merge have the same results.

> 
> Dave
> 
> 
> -----
> 
> 
> 8 Choosing What to Query
> 
> A SPARQL query is against a single RDF *Query Graph*.

Minor point but "query graph" might be misinterpreted as a graph that is the
query itself.  I know that isn't what your definition says but the name does
suggest it.  [In the telecon you suggested "data graph" which avoids the 
potential confusion]


>  This graph may
> be constructed through logical inference, and never materialized.  It
> can be arbitrarily large or infinite.  The Query Graph is a virtual
> RDF-merge operation over a set of *RDF Graphs*:
> 
>   Definition: Query Graph
> 
>   Given a set of RDF Graphs {RG1, ..., RGn}, the Query Graph QG is
>   an RDF graph formed from the RDF-merge of the set {RG1, ..., RGn}.
> 
>   All of the graphs RG1...RGn have *Graph Names* GN1...GNn which are
>   URI References (URIrefs) 
> 
> where RDF-merge is defined in RDF Semantics 0.3 Graph Definitions
>   http://www.w3.org/TR/2004/REC-rdf-mt-20040210/#graphdefs 

Can the data graph have triples in it which are not part of any subgraph?  The 
definitions says "no" - its important to know as it effects where provenance 
information might go.

Does this definition make it the same as a single RDF graph where the triples 
have zero or more labels  - there is a series of overlapping regions in the 
graph identified by label.  There is only one graph (so maybe no issues about 
bNodes).

If this is so, then we might be about to frame the thing more simply and in 
particularly only talks about querying one graph through out the doc.  It means 
that the rest of the doc does not need to change based on changes to the SOURCE 
section because everything is in terms of one graph.  I think it is the same as 
you are proposing - it is something Eric mentioned awhile back.

bNodes become possible because it is well defined as to which graph they are
associated with - the data graph of which there is only one.
This just leaves use case 1) where there isn't a single graph underneath - but 
as it is defined as a merge (alternative a union but keeping the bNodes apart) I 
think its alright.


> 
> 
> The Query Graph can be defined in the following ways:
> 
> 1) In the SPARQL query language using the FROM clause
> 
>    See below.
> 
> 2) By the SPARQL protocol
> 
>    ISSUE: Depends on protocol doc.  Probably works by giving the set
>    of URIrefs of the graphs? or giving a URIref for the query graph?
>    or query service?
> 
> 3) Against a default query graph if neither of 1) or 2) are given.
>    This is application-specific.
> 

I would like the query context to override what the query might say - the use
case is that query might identify the target but (e.g. testing, caching,
redeploying an application on a different dataset) the writer wants to execute
the query on a different target.

This is useful in the test suite as well as the query may go to a locally cached
copy of the graph.

> 
> In the SPARQL query language the FROM clause can specify the set of
> graphs by either giving their names or giving the URIs for a resource
> that can be used to retrieve the graph.
> 
> (Q8.1) The query
> 
>   SELECT *
>   FROM <http://www.w3.org/2000/08/w3c-synd/home.rss>
>   WHERE ( ?x ?y ?z )
> 
> creates a Query Graph by using the resource at URI 
>   http://www.w3.org/2000/08/w3c-synd/home.rss
> to provide RDF triples, making an RDF graph RG1.  Graph RG1 is named
> by the URI and constructs a query graph from the set {RG1}.

I thought the discussions on FROM had got to the point where it is a more of a
hint ("use this graph"), not an loading operation.

Could you say some more about "creates a graph"?

Some systems either have the graph or don't. Would that normally be handled by
the query context, not using FROM, NAMED etc?

> 
> 
> (Q8.2) The query
> 
>   SELECT *
>   FROM <http://www.w3.org/2000/08/w3c-synd/home.rss> NAMED <http://example.org/>
>   WHERE ( ?x ?y ?z )
> 
> Constructs the same query graph but names the graph RG1 <http://example.org/>

I thought the need for alternative names for graphs arose because they could be
read in more than once (at different times and hence have different triples).

If so, I don't see how the NAMED modifier changes the proposal as the subgraph 
can only have one FROM name and one NAMED name so why not the same?  Have you an 
example of its use?

(although oddly, it looks the NAMED URI rather like a ddnames in IBM JCL! Sort 
of names instead of numbers for file descriptors in UNIX).

> 
> (Q8.3) The query
> 
>   SELECT *
>   FROM NAMED <http://example.org/>
>   WHERE ( ?x ?y ?z )
> 
> Creates a query graph from a set of 1 graph named <http://example.org/>
> The URI here is not for resource retrieval.

If FROM has the weaker hint meaning then

    FROM <http://example.org/>

that is, some local name for <http://www.w3.org/2000/08/w3c-synd/home.rss>
and the form

   FROM <http://www.w3.org/2000/08/w3c-synd/home.rss>

seen to contain the same information - one name, one graph.  The FROM does not 
allow the same FROM URI read at different times as I understand the proposal.

There can't really be a confusion over the names because <http://example.org/>
is just another URI for home.rss especially if FROM is more hint-like.

> 
> 
> When multiple graphs are given in FROM, the RDF-merge of the set of
> graphs is performed to create the query graph.
> 
> The query
> (Q8.4)
>   SELECT *
>   FROM <uri1>, <uri2>
>   WHERE ( ?x ?y ?z )
> 
> creates a query graph from the RDF-merge from the set of graphs {RG1, RG2} 
> where
>   RG1 is the RDF graph formed by retrieving the resource at uri1 and
>     named uri1
>   RG2 is the RDF graph formed by retrieving the resource at uri2 and
>     named uri2
> 
> 
> A SPARQL implementation MAY not support graph names in which case the
> queries that use only the NAMED keyword will fail - Q8.3
> 
> 
>   Possible extension:
> 
>   Allow graphs with a local name (blank node label)
> 
>   (Q8.5)
>     SELECT *
>     FROM NAMED _:a, NAMED _:b
>     WHERE ( ?x ?y ?z )
> 
>   rather than relying on the application-specific choice 3) above.
> 
>   However details below would have to be changed to forbid returning
>   the blank nodes of the names in results.

Could you say why the returning of bNodes is banned here but isn't in other 
results (or, stricly, not returning bNodes but returning the encoding with 
document wide bNodes ids).

Even in the XML form, we'll need <var bnodeId="foo"/> so that two results can be 
distinguished (the graph had different bNodes so the results don't collapse the 
set of results).

[Aside: note to editor - wite that special section on bNodes 2.5]

> 
> 
> 9 Querying the Origin of Statements
> 
> While the RDF data model is limited to expressing triples with a
> subject, predicate and object, many RDF data stores augment this with
> a notion of the source of each triple.  Typically, implementations
> associate RDF triples or graphs with a URI specifying their real or
> virtual origin.  The SOURCE keyword allows you to query or constrain
> the source of the following triple pattern or nested graph
> pattern. The general form of the SOURCE query is:
> 
>  SOURCE ?var (?s ?p ?o)
> 
> When SOURCE ?var is given before a triple, the variable will be bound
> to all of the known *Graph Names* for that triple.  A data store that
> does not support graph names SHOULD provide no binding for the SOURCE
> variables.

but the (?s ?p ?o) still matches?

We coudl require all graphs be named - it does not seem a burden to have a URI 
for a graph at all times even if locally generated.

> 
>   D9.1 Data:
> 
>   Graph G1 named <aliceFoaf.n3>
>   @prefix  foaf:  <http://xmlns.com/foaf/0.1/> .
> 
>   _:1 foaf:mbox <mailto:alice@work.example>.
>   _:1 foaf:knows _:2.
>   _:2 foaf:mbox <mailto:bob@work.example>.
>   _:2 foaf:age 32.
> 
>   Graph G2 named <bobFoaf.n3>
>   @prefix  foaf:  <http://xmlns.com/foaf/0.1/> .
> 
>   _:1 foaf:mbox <mailto:bob@work.example>.
>   _:1 foaf:PersonalProfileDocument <bobFoaf.n3>.
>   _:1 foaf:age 35.
> 
> 
>   The Query Graph is the RDF-merge of {G1, G2}
> 
> 
>   Q9.1 Query:
> 
>   PREFIX foaf:    <http://xmlns.com/foaf/0.1/>
>   SELECT ?mbox ?age ?ppd
>   WHERE       ( ?alice foaf:mbox <mailto:alice@work.example> )
> 	      ( ?alice foaf:knows ?whom )
> 	      ( ?whom foaf:mbox ?mbox )
> 	      ( ?whom foaf:PersonalProfileDocument ?ppd )
>   SOURCE ?ppd ( ?whom foaf:age ?age )

Would this also allow asking for a pattern in the SOURCE part:

e.g.  SOURCE ?ppd { ( ?whom foaf:age ?age ) ( ?whom foaf:birthday ?bday ) }

which is the same as:

SOURCE ?ppd ( ?whom foaf:age ?age )
SOURCE ?ppd ( ?whom foaf:birthday ?bday )

This is treating ?ppd just like anyother variable.

It is at this point that I don't understand the restriction forbidding SOURCE 
<uri> ...  because

SOURCE <bobFoaf.n3> ( ?whom foaf:age ?age )

looks reasonable and implementable to me (e.g. the case of several
PersonalProfileDocument documents and want to force one.).

(?x foaf:PersonalProfileDocument ?ppd)
SOURCE ?ppd ( ?whom foaf:age ?age )

already fixed ?ppd as <bobFoaf.n3> anyway.


Ah - more I don't understand.  The bNode _:1 in bobFoaf.n3 is not the same as
the bnode _:1 in aliceFoaf.n3 so I am unsure what ?whom is getting set to.  The 
"one graph with regions" explanation may help here.

> 
>   R9.1 Result:
>   mbox                      	age 	ppd
>   <mailto:bob@work.example> 	35 	<bobFoaf.n3>
> 
> This query returns the email addresses of people that Alice knows. It
> also returns their age according to their PersonalProfileDocument
> documents, as well as the URI of the graph. Alice's guess of Bob's
> age (32) is not returned.
> 
> 
> Any variable that is not bound must not match another variable that
> is not bound. Thus,
> 
>   Query Q9.2:
>   PREFIX foaf:    <http://xmlns.com/foaf/0.1/>
>   SELECT ?given ?family
>   WHERE SOURCE ?ppd ( ?whom foaf:given ?family )
> 	SOURCE ?ppd ( ?whom foaf:family ?family )
> 
> will match only if the source of both triples are known and the same.
> 
> A SPARQL implementation MAY not support graph names in which case the
> SOURCE ?var parts are ignored.

I need to think about this more - I'm worried that the ?var in "SOURCE ?var" 
behaves differently from other variables which might be a real nuisence when 
?var can appear elsewhere in the query.

Or can you assure me it is a regular variable?

> 
> -----------------
> 
> References
> 
> Named Containers
> http://lists.w3.org/Archives/Public/public-rdf-dawg/2004JulSep/0581.html

idea originally from:
http://lists.w3.org/Archives/Public/www-rdf-interest/2004Aug/0225.html

> 
> Named Graphs and TriX
> http://www.w3.org/2004/03/trix/
> 
> Named Graphs, Provenance and Trust
> Carroll, Jeremy J.; Bizer, Christian; Hayes, Patrick; Stickler, Patrick
> HPL-2004-57, 20040513 
> http://hpl.hp.com/techreports/2004/HPL-2004-57.html
> 
> ...
>
Received on Tuesday, 9 November 2004 17:21:45 UTC