Re: comment "Named- and background graphs, triples vs quads, trust, etc." on SOURCE, fromUnionQuery from Seaborne, Andy on 2005-03-31 (public-rdf-dawg@w3.org from January to March 2005)

From: Seaborne, Andy <andy.seaborne@hp.com>
Date: Thu, 31 Mar 2005 10:17:43 +0100
To: Dan Connolly <connolly@w3.org>
CC: RDF Data Access Working Group <public-rdf-dawg@w3.org>
Message-ID: <424BC037.4020802@hp.com>
Dan Connolly wrote:
> Here's another comment that I'm not quite sure what
> to do with...
> 
>  Named- and background graphs, triples vs quads, trust, etc.
> http://lists.w3.org/Archives/Public/public-rdf-dawg-comments/2005Mar/0097.html

See also:
http://lists.w3.org/Archives/Public/public-rdf-dawg-comments/2004Nov/0020.html

> 
> It is perhaps a request that we reconsider the SOURCE issue...
>   http://www.w3.org/2001/sw/DataAccess/issues#SOURCE
> 
> I'm not in a good position to advocate the WG's decision on that issue;
> that was the first of N issues that I tried, without success, to get
> the WG to postpone. (hmm... I'm not on record as abstaining on the
> decision we took... I wonder why not...)
> 
> The comment suggests "move the choice of arrangement into the
> query language," which I don't think we considered. Perhaps that's
> sufficient new information to re-open the issue.

I read that as a request for FROM/WITH in the query language which we decided
not to do.  In another comments list message, they were pointed at the protocol
spec:

http://lists.w3.org/Archives/Public/public-rdf-dawg-comments/2005Mar/0072.html

> 
> The comment says it's a follow-up from discussion with Andy, so I doubt
> he's in a position to defend the current design to the satisfaction
> of the commentor; it seems he's already tried.

The discussions were mainly about counting and what it means to count bNodes -
the only mention of named graphs was for attaching probabilities to triples
(statings, not statements, presumably).  We haven't been talking much about
datasets.

> 
> DaveB, you were involved in some proposals that led up to the WG's
> decision... you're more than welcome to give it a try.
> 
> The comment is also perhaps input to our most long-standing open issue
> fromUnionQuery.
>   http://www.w3.org/2001/sw/DataAccess/issues#fromUnionQuery

Without FROM/WITH in the query language, I think this is a protocol issue (sorry
Kendall!).

> 
> I don't have any actions assigned about that one... I don't really
> have any plan for addressing it. I'm all ears.
> 
> 

-------------------------------------------------------------

I believe that the message from Arjohn (2005Mar/0097.html) does not take into
account a difference between a closed system and a web system.

There is no mention of the information publisher, just tool maker and the query.

Arjohn wrote:
  > My main concern with the current spec is that it leaves the choice of
  > the arrangement for RDF Datasets up to the implementer of the query
  > engine.

The choice of the arrangement is up to the person/organisation publishing the
data, not the query engine.

If the publisher is building a system where they wish to have all triples
in the background graph, they will choose their query engine provider in one
way; if they wish not to make any trust claim about the triples in the named
graphs, they will choose their query engine another way.

Arjohn wrote:
  > Keeping the
  > SPARQL spec as it is today can have disastrous effects on the
  > interoperability of SPARQL-aware tools.

Interoperability is about the same dataset behaving the same.  If one system
automatically merges all the named graphs and one doesn't, it isn't the same
dataset.


If we have a query like:

   SELECT * WHERE { ?s ?p ?o }

is it answered from information that the publisher asserts or is it something
the publisher is just serving up should not depend on whether there are any
named graphs in the dataset.

In closed system, the application and the publisher are often the same or part
of the same organisation.  So saying "you must check the origin of all triples"
can be applied.

On the web, this is not true.  The user/application/client can be unconnected to
the publisher/server.


By defaulting to accessing all triples, all queries are "caveat emptor" - no
client can rely on trusting any publisher.

  > I see two possible ways to solve this issue:
  >   1/ standardize on a single arrangement (preferably the latter), or
  >   2/ move the choice of arrangement into the query language.

We do have a single arrangement - the background graph is separate from the
named graphs.  The publisher is free to create a background graph based on their
beliefs of who to trust and who not to.

Maybe I should make one of the examples in rq23 have no background graph.

2/ places the choice with the application, not the information publisher.  But
it's the information publisher who is asserting the statements.

  > If the first arrangement of named and background graphs is considered,
  > then this query mechanism essentially is a mechanism for querying quads,
  > not triples! The graph name is no longer just an ignorable attribute of
  > triples, but is now an essential part of it. It appears to me that there
  > is a mismatch between RDF and SPARQL here.

This seems key - the graph name is not ignorable.

If a data provider publishes an RDF graph without further information,
then that data provider is responsible for that information.   That is the
background graph (default knowledge base).

The unnamed graph is being published without further information (it's just a
graph on the web) and as such it is the data provider who is publishing it.

By providing named graphs, we provide a way to export a graph without it going
under the label of coming from the data provider.

So the two choices are to require all information to be checked ("caveat emptor" 
- or trust until proven not to be trustworthy, default is to trust) or to not 
trust information until its provenance is verified (publishers are responsible 
for information they publish - applications add things into the space of things 
they trust, not remove them later).

[[
To refer to a different area: The Guardian newspapers styleguide:
   http://www.guardian.co.uk/styleguide/article/0,5817,354123,00.html
for a discussion on naming sources in the newspaper : the question is how can 
the reader evaluate who to believe without information about the source]]


Automatically, putting all triples in the unnamed graph is defaulting to 
trusting them because

SELECT * WHERE { ?s ?p ?o }

is taking the default for triples.  That query should work whether there are
additional named graphs in the dataset (which the application may not be aware
of) or not, and also whether named graphs are added to the dataset later. 
Having it vary by whether the publisher has choosen not to place
the necessary information for checking in the dataset is very dangerous.

It then comes down to whether the application writer is responsible for
checking all triples (the legal principle of caveat emptor) or whether the
publisher is responsible for the background graphs they publish.



There is a further technical issue as well:

SELECT * WHERE { :foo :p ?x . :foo :q ?v }

may find solutions but if the first triple pattern matches only in one graph
and the second triple pattern only matches in a second graph, then there is no
graph that matches the full pattern and you can't ask where it came from yet
the query returns variable bindings.  Why is the combined pattern a graph match?
Because the publisher put all the triples together.

 Andy
Received on Thursday, 31 March 2005 09:18:36 UTC