Re: comment "Named- and background graphs, triples vs quads, trust, etc." on SOURCE, fromUnionQuery from Seaborne, Andy on 2005-03-31 (public-rdf-dawg@w3.org from January to March 2005)

From: Seaborne, Andy <andy.seaborne@hp.com>
Date: Thu, 31 Mar 2005 15:59:11 +0100
To: "Thompson, Bryan B." <BRYAN.B.THOMPSON@saic.com>
CC: "'Dan Connolly '" <connolly@w3.org>, "'RDF Data Access Working Group '" <public-rdf-dawg@w3.org>
Message-ID: <424C103F.1020408@hp.com>
Thompson, Bryan B. wrote:
> Let me suggest that perhaps this entire notion of "publisher" is somewhat
> orthoganal to RDF query.  If you take SQL as an anology (as is done often
> in this WG), then database management is the level at which data sets are
> shaped and made available for integration within a query.  This is, I
> presume, the "web" side to which Andy is referring - that these RDF graphs
> may exist 'on the web' rather than solely within the confines of a managed
> database.

Just some clarification. The publisher is the person/organisation making the 
data available.  So making the NCI ontology queriable over the web is 
"publishing".  Th esimplest case is a fixed dataset provided by a publisher. 
The service description could say what's in the dataset.  It's a "take it or 
leave it" option to the querying application.

It may be not the best term - but I wanted something that did not imply that the 
  application consuming the information and the entity providing the information 
were necessarily connected.

Putting up a web page is publishing. The data management is not necessarily web 
accessible - not one the information consumer has any access to.

 Andy

> 
> I think that an implementor is faced with two bad choices with respect
> to the "web" aspect of the system (the ability to cause RDF graphs on
> the web (vs some notion of managed RDF graphs) to be queried).  First, 
> one can fetch the data, which pretty much rules out scale since this
> will NOT work for large datasets.  Second, one can provide some level
> of non-transparent caching over fetched data.  Neither of these are
> good choices as neither as subject to the control of the person writing
> or submitting a query (within the current framework).
> 
> Kendall has proposed a variety of protocol operations which could help
> here by providing applications with a way to create a populate RDF
> resources that could be "local" to the query engine.  This is essentially
> a proposal for REST-ful operations to support applications interested in
> managed data.  For example, if you want to write queries using the NCI
> ontology, you could cause a known version that ontology to be stored "at"
> the query engine, which would make it possible for the implementation to
> do some smart things in a mannar that is relatively transparent to the
> application.
> 
> There is no manner that I can see in which the inability of the
> application to control which data are present in the "dataset"
> can lead to anything except a lack of interoperability.  We need to
> give the application control over this.  If a particular server would
> like to introduce "additional" data into the dataset, then perhaps this
> can be accomplished by interposing an intermediary on connections leading
> to the server - one which modifies the protocol request so as to make
> explicit the data to be incorporated rather than leaving this "up to the
> server".
> 
> I think that there is a real difference here.  In one world the server
> can fudge the contract concerning its data.  In the other, the contract
> is made explicit by the query and/or protocol request, even if that is
> being modified by an intermediary.
> 
> Just some thoughts.
> 
> -bryan
> 
> -----Original Message-----
> From: public-rdf-dawg-request@w3.org
> To: Dan Connolly
> Cc: RDF Data Access Working Group
> Sent: 3/31/2005 4:18 AM
> Subject: Re: comment "Named- and background graphs, triples vs quads, trust,
> etc." on SOURCE, fromUnionQuery
> 
> 
> 
> 
> Dan Connolly wrote:
> 
>>Here's another comment that I'm not quite sure what
>>to do with...
>>
>> Named- and background graphs, triples vs quads, trust, etc.
>>
> 
> http://lists.w3.org/Archives/Public/public-rdf-dawg-comments/2005Mar/009
> 7.html
> 
> See also:
> http://lists.w3.org/Archives/Public/public-rdf-dawg-comments/2004Nov/002
> 0.html
> 
> 
>>It is perhaps a request that we reconsider the SOURCE issue...
>>  http://www.w3.org/2001/sw/DataAccess/issues#SOURCE
>>
>>I'm not in a good position to advocate the WG's decision on that
> 
> issue;
> 
>>that was the first of N issues that I tried, without success, to get
>>the WG to postpone. (hmm... I'm not on record as abstaining on the
>>decision we took... I wonder why not...)
>>
>>The comment suggests "move the choice of arrangement into the
>>query language," which I don't think we considered. Perhaps that's
>>sufficient new information to re-open the issue.
> 
> 
> I read that as a request for FROM/WITH in the query language which we
> decided
> not to do.  In another comments list message, they were pointed at the
> protocol
> spec:
> 
> http://lists.w3.org/Archives/Public/public-rdf-dawg-comments/2005Mar/007
> 2.html
> 
> 
>>The comment says it's a follow-up from discussion with Andy, so I
> 
> doubt
> 
>>he's in a position to defend the current design to the satisfaction
>>of the commentor; it seems he's already tried.
> 
> 
> The discussions were mainly about counting and what it means to count
> bNodes -
> the only mention of named graphs was for attaching probabilities to
> triples
> (statings, not statements, presumably).  We haven't been talking much
> about
> datasets.
> 
> 
>>DaveB, you were involved in some proposals that led up to the WG's
>>decision... you're more than welcome to give it a try.
>>
>>The comment is also perhaps input to our most long-standing open issue
>>fromUnionQuery.
>>  http://www.w3.org/2001/sw/DataAccess/issues#fromUnionQuery
> 
> 
> Without FROM/WITH in the query language, I think this is a protocol
> issue (sorry
> Kendall!).
> 
> 
>>I don't have any actions assigned about that one... I don't really
>>have any plan for addressing it. I'm all ears.
>>
>>
> 
> 
> -------------------------------------------------------------
> 
> I believe that the message from Arjohn (2005Mar/0097.html) does not take
> into
> account a difference between a closed system and a web system.
> 
> There is no mention of the information publisher, just tool maker and
> the query.
> 
> Arjohn wrote:
>   > My main concern with the current spec is that it leaves the choice
> of
>   > the arrangement for RDF Datasets up to the implementer of the query
>   > engine.
> 
> The choice of the arrangement is up to the person/organisation
> publishing the
> data, not the query engine.
> 
> If the publisher is building a system where they wish to have all
> triples
> in the background graph, they will choose their query engine provider in
> one
> way; if they wish not to make any trust claim about the triples in the
> named
> graphs, they will choose their query engine another way.
> 
> Arjohn wrote:
>   > Keeping the
>   > SPARQL spec as it is today can have disastrous effects on the
>   > interoperability of SPARQL-aware tools.
> 
> Interoperability is about the same dataset behaving the same.  If one
> system
> automatically merges all the named graphs and one doesn't, it isn't the
> same
> dataset.
> 
> 
> If we have a query like:
> 
>    SELECT * WHERE { ?s ?p ?o }
> 
> is it answered from information that the publisher asserts or is it
> something
> the publisher is just serving up should not depend on whether there are
> any
> named graphs in the dataset.
> 
> In closed system, the application and the publisher are often the same
> or part
> of the same organisation.  So saying "you must check the origin of all
> triples"
> can be applied.
> 
> On the web, this is not true.  The user/application/client can be
> unconnected to
> the publisher/server.
> 
> 
> By defaulting to accessing all triples, all queries are "caveat emptor"
> - no
> client can rely on trusting any publisher.
> 
>   > I see two possible ways to solve this issue:
>   >   1/ standardize on a single arrangement (preferably the latter), or
>   >   2/ move the choice of arrangement into the query language.
> 
> We do have a single arrangement - the background graph is separate from
> the
> named graphs.  The publisher is free to create a background graph based
> on their
> beliefs of who to trust and who not to.
> 
> Maybe I should make one of the examples in rq23 have no background
> graph.
> 
> 2/ places the choice with the application, not the information
> publisher.  But
> it's the information publisher who is asserting the statements.
> 
>   > If the first arrangement of named and background graphs is
> considered,
>   > then this query mechanism essentially is a mechanism for querying
> quads,
>   > not triples! The graph name is no longer just an ignorable attribute
> of
>   > triples, but is now an essential part of it. It appears to me that
> there
>   > is a mismatch between RDF and SPARQL here.
> 
> This seems key - the graph name is not ignorable.
> 
> If a data provider publishes an RDF graph without further information,
> then that data provider is responsible for that information.   That is
> the
> background graph (default knowledge base).
> 
> The unnamed graph is being published without further information (it's
> just a
> graph on the web) and as such it is the data provider who is publishing
> it.
> 
> By providing named graphs, we provide a way to export a graph without it
> going
> under the label of coming from the data provider.
> 
> So the two choices are to require all information to be checked ("caveat
> emptor" 
> - or trust until proven not to be trustworthy, default is to trust) or
> to not 
> trust information until its provenance is verified (publishers are
> responsible 
> for information they publish - applications add things into the space of
> things 
> they trust, not remove them later).
> 
> [[
> To refer to a different area: The Guardian newspapers styleguide:
>    http://www.guardian.co.uk/styleguide/article/0,5817,354123,00.html
> for a discussion on naming sources in the newspaper : the question is
> how can 
> the reader evaluate who to believe without information about the
> source]]
> 
> 
> Automatically, putting all triples in the unnamed graph is defaulting to
> 
> trusting them because
> 
> SELECT * WHERE { ?s ?p ?o }
> 
> is taking the default for triples.  That query should work whether there
> are
> additional named graphs in the dataset (which the application may not be
> aware
> of) or not, and also whether named graphs are added to the dataset
> later. 
> Having it vary by whether the publisher has choosen not to place
> the necessary information for checking in the dataset is very dangerous.
> 
> It then comes down to whether the application writer is responsible for
> checking all triples (the legal principle of caveat emptor) or whether
> the
> publisher is responsible for the background graphs they publish.
> 
> 
> 
> There is a further technical issue as well:
> 
> SELECT * WHERE { :foo :p ?x . :foo :q ?v }
> 
> may find solutions but if the first triple pattern matches only in one
> graph
> and the second triple pattern only matches in a second graph, then there
> is no
> graph that matches the full pattern and you can't ask where it came from
> yet
> the query returns variable bindings.  Why is the combined pattern a
> graph match?
> Because the publisher put all the triples together.
> 
>  Andy
> 
> 
>
Received on Thursday, 31 March 2005 15:00:22 UTC