Source and provenance words. from Dave Beckett on 2004-09-06 (public-rdf-dawg@w3.org from July to September 2004)

From: Dave Beckett <dave.beckett@bristol.ac.uk>
Date: Mon, 6 Sep 2004 17:32:53 +0100
To: RDF Data Access Working Group <public-rdf-dawg@w3.org>
Message-Id: <20040906173253.0acd2c9b@hoth.ilrt.bris.ac.uk>
I volunteered to own this issue recorded as:
  ACTION: DaveB to repropose source in both results and restrictions.
written based on the feedback to my earlier email:
  http://lists.w3.org/Archives/Public/public-rdf-dawg/2004JulSep/0307.html

Refering to 
  http://www.w3.org/2001/sw/DataAccess/rq23/
  $Revision: 1.52 $ of $Date: 2004/09/03 15:04:38 $

I have written for section:
  9. Querying the Origin of Statements
  http://www.w3.org/2001/sw/DataAccess/rq23/#source
but I also needed something about data sources so I've got
some additional words for section 6.  

There is also some terminology change compared to the above email,
now using Origin rather than Source to try to distinguish what might
be a DAWG service access point (DAWG protocol API, WSDL1.x portType,
WSDL2 instance of interface, target, file or implicit graph) from a
URI of content.

Some comments below after the words.

----------------------------------------------------------------------

6. Choosing What to Query

...

DAWG queries operate against an RDF graph which is given implicitly
where the graph context is known by the application or known
externally such as by the DAWG protocol.  A FROM statement can also
explicitly give a Source URI:

  FROM <http://www.w3.org/2000/08/w3c-synd/home.rss>
  SELECT * WHERE
    (?x ?y ?z)

The URI is retrieved and the resulting representation should
represent RDF triples in some syntax, such as RDF/XML which provide
the query graph.

Aggregate graphs may also be queried by using multiple source URIs in
the FROM clause such as:

  FROM <uri1>, <uri2>
  SELECT ...

However this is implemented, the result must be equivalent to
retrieving the Source URIs and forming the aggregate graph from the
triples returned.  Implementations provide a single web service
target that aggregates multiple Source URIs, accessed by the DAWG
protocol or some other mechanism.

  Issue: Refering to the DAWG protocol lots here without checking the
  requirements for the protocol match.

The RDF graph may be constructed through inference rather than
retrieval or never be materialized.

  Issue: Does the use of Source URI and representation for graphs
  make sense with this.



9. Working with the Origin of Triples
[Note change of section title]


The Origin of an RDF triple in a query graph is the RDF URI Reference
(ref) where a resource representation was retrieved that provided
that triple, which may be in an aggregate graph.

  Issue: Could allow blank nodes here which would help with the
  inferred or non-materialized graphs.

  TRiX http://www.w3.org/2004/03/trix/ removed this after originally
  allowing named graphs to be named by blank nodes.  I think this was
  due to scoping issues.  FIXME Find reference to why it was removed.
  The ISWC2004 paper?

A triple in an RDF graph may have zero or more Origins.  A BRQL
application may optionally not support recording and providing origin
information.

  Issue: Making origin information optional does not help
  interoperability.

The Origin of a triple may be used in queries with the SOURCE
clause before a triple

Example 9.1:
  Find all triples in an graph of aggregated RSS 1.0 feeds which were
  retrieved from the W3C's feed.

Data:
  An aggregated graph of RSS 1.0 feeds including the triples
  retrieved from Source URI http://www.w3.org/2000/08/w3c-synd/home.rss

Query:
  SELECT ?x,?y,?z WHERE
    SOURCE <http://www.w3.org/2000/08/w3c-synd/home.rss> (?x ?y ?z)

result:
  the RDF triples with origin
    <http://www.w3.org/2000/08/w3c-synd/home.rss> 


If the application does not support origin information or no origin
information was recorded when the aggregated graph was created, no
results are returned.

  Issue: Change to ORIGIN keyword?

  Issue: Make RDF triples become non-RDF quads.

Origin information can be returned by queries using a variable
with the SOURCE clause.

Example 9.2

  An aggregate graph contains aggregated RSS 1.0 feeds and the query
  wants to return all items indicating where they were originally
  retrieved from, even with duplicates:

Data:
  an aggregate graph of RSS 1.0 feeds

Query:
  SELECT ?s WHERE
    SOURCE ?s (?x rdf:type rss:item)

Results:
  ?s= ... ?x=...
  for each RSS 1.0 item in the aggregate graph.

  The ?s variables bind to the Origin URIs of the triple that matched.

If an implementation does not support Origin information, the SOURCE
?s clause is ignored and no binding value is returned for ?s.

  ?s=null ?x=...
  ...

  Issue: this means adding a null value definition which I know is
    tricky.  The alternate is to not give a result for ?s:

      ?x=...
      ...
    which means the returned result set is not regular.


----------------------------------------------------------------------

My opinion on RDF quads for provenance in queries is well known and
I've discussed this before in depth[1].  I don't like to see them in
RDF query languages since there is little consensus what the fourth
item is - that is one thing we are considering.

They are of course fine as implementation techniques.  What's inside
your application is up to you.

Dave

[1]
http://lists.w3.org/Archives/Public/public-rdf-dawg/2004JulSep/0305.html
Received on Monday, 6 September 2004 16:35:02 UTC