- From: Seaborne, Andy <andy.seaborne@hp.com>
- Date: Thu, 31 Mar 2005 15:59:11 +0100
- To: "Thompson, Bryan B." <BRYAN.B.THOMPSON@saic.com>
- CC: "'Dan Connolly '" <connolly@w3.org>, "'RDF Data Access Working Group '" <public-rdf-dawg@w3.org>
Thompson, Bryan B. wrote: > Let me suggest that perhaps this entire notion of "publisher" is somewhat > orthoganal to RDF query. If you take SQL as an anology (as is done often > in this WG), then database management is the level at which data sets are > shaped and made available for integration within a query. This is, I > presume, the "web" side to which Andy is referring - that these RDF graphs > may exist 'on the web' rather than solely within the confines of a managed > database. Just some clarification. The publisher is the person/organisation making the data available. So making the NCI ontology queriable over the web is "publishing". Th esimplest case is a fixed dataset provided by a publisher. The service description could say what's in the dataset. It's a "take it or leave it" option to the querying application. It may be not the best term - but I wanted something that did not imply that the application consuming the information and the entity providing the information were necessarily connected. Putting up a web page is publishing. The data management is not necessarily web accessible - not one the information consumer has any access to. Andy > > I think that an implementor is faced with two bad choices with respect > to the "web" aspect of the system (the ability to cause RDF graphs on > the web (vs some notion of managed RDF graphs) to be queried). First, > one can fetch the data, which pretty much rules out scale since this > will NOT work for large datasets. Second, one can provide some level > of non-transparent caching over fetched data. Neither of these are > good choices as neither as subject to the control of the person writing > or submitting a query (within the current framework). > > Kendall has proposed a variety of protocol operations which could help > here by providing applications with a way to create a populate RDF > resources that could be "local" to the query engine. This is essentially > a proposal for REST-ful operations to support applications interested in > managed data. For example, if you want to write queries using the NCI > ontology, you could cause a known version that ontology to be stored "at" > the query engine, which would make it possible for the implementation to > do some smart things in a mannar that is relatively transparent to the > application. > > There is no manner that I can see in which the inability of the > application to control which data are present in the "dataset" > can lead to anything except a lack of interoperability. We need to > give the application control over this. If a particular server would > like to introduce "additional" data into the dataset, then perhaps this > can be accomplished by interposing an intermediary on connections leading > to the server - one which modifies the protocol request so as to make > explicit the data to be incorporated rather than leaving this "up to the > server". > > I think that there is a real difference here. In one world the server > can fudge the contract concerning its data. In the other, the contract > is made explicit by the query and/or protocol request, even if that is > being modified by an intermediary. > > Just some thoughts. > > -bryan > > -----Original Message----- > From: public-rdf-dawg-request@w3.org > To: Dan Connolly > Cc: RDF Data Access Working Group > Sent: 3/31/2005 4:18 AM > Subject: Re: comment "Named- and background graphs, triples vs quads, trust, > etc." on SOURCE, fromUnionQuery > > > > > Dan Connolly wrote: > >>Here's another comment that I'm not quite sure what >>to do with... >> >> Named- and background graphs, triples vs quads, trust, etc. >> > > http://lists.w3.org/Archives/Public/public-rdf-dawg-comments/2005Mar/009 > 7.html > > See also: > http://lists.w3.org/Archives/Public/public-rdf-dawg-comments/2004Nov/002 > 0.html > > >>It is perhaps a request that we reconsider the SOURCE issue... >> http://www.w3.org/2001/sw/DataAccess/issues#SOURCE >> >>I'm not in a good position to advocate the WG's decision on that > > issue; > >>that was the first of N issues that I tried, without success, to get >>the WG to postpone. (hmm... I'm not on record as abstaining on the >>decision we took... I wonder why not...) >> >>The comment suggests "move the choice of arrangement into the >>query language," which I don't think we considered. Perhaps that's >>sufficient new information to re-open the issue. > > > I read that as a request for FROM/WITH in the query language which we > decided > not to do. In another comments list message, they were pointed at the > protocol > spec: > > http://lists.w3.org/Archives/Public/public-rdf-dawg-comments/2005Mar/007 > 2.html > > >>The comment says it's a follow-up from discussion with Andy, so I > > doubt > >>he's in a position to defend the current design to the satisfaction >>of the commentor; it seems he's already tried. > > > The discussions were mainly about counting and what it means to count > bNodes - > the only mention of named graphs was for attaching probabilities to > triples > (statings, not statements, presumably). We haven't been talking much > about > datasets. > > >>DaveB, you were involved in some proposals that led up to the WG's >>decision... you're more than welcome to give it a try. >> >>The comment is also perhaps input to our most long-standing open issue >>fromUnionQuery. >> http://www.w3.org/2001/sw/DataAccess/issues#fromUnionQuery > > > Without FROM/WITH in the query language, I think this is a protocol > issue (sorry > Kendall!). > > >>I don't have any actions assigned about that one... I don't really >>have any plan for addressing it. I'm all ears. >> >> > > > ------------------------------------------------------------- > > I believe that the message from Arjohn (2005Mar/0097.html) does not take > into > account a difference between a closed system and a web system. > > There is no mention of the information publisher, just tool maker and > the query. > > Arjohn wrote: > > My main concern with the current spec is that it leaves the choice > of > > the arrangement for RDF Datasets up to the implementer of the query > > engine. > > The choice of the arrangement is up to the person/organisation > publishing the > data, not the query engine. > > If the publisher is building a system where they wish to have all > triples > in the background graph, they will choose their query engine provider in > one > way; if they wish not to make any trust claim about the triples in the > named > graphs, they will choose their query engine another way. > > Arjohn wrote: > > Keeping the > > SPARQL spec as it is today can have disastrous effects on the > > interoperability of SPARQL-aware tools. > > Interoperability is about the same dataset behaving the same. If one > system > automatically merges all the named graphs and one doesn't, it isn't the > same > dataset. > > > If we have a query like: > > SELECT * WHERE { ?s ?p ?o } > > is it answered from information that the publisher asserts or is it > something > the publisher is just serving up should not depend on whether there are > any > named graphs in the dataset. > > In closed system, the application and the publisher are often the same > or part > of the same organisation. So saying "you must check the origin of all > triples" > can be applied. > > On the web, this is not true. The user/application/client can be > unconnected to > the publisher/server. > > > By defaulting to accessing all triples, all queries are "caveat emptor" > - no > client can rely on trusting any publisher. > > > I see two possible ways to solve this issue: > > 1/ standardize on a single arrangement (preferably the latter), or > > 2/ move the choice of arrangement into the query language. > > We do have a single arrangement - the background graph is separate from > the > named graphs. The publisher is free to create a background graph based > on their > beliefs of who to trust and who not to. > > Maybe I should make one of the examples in rq23 have no background > graph. > > 2/ places the choice with the application, not the information > publisher. But > it's the information publisher who is asserting the statements. > > > If the first arrangement of named and background graphs is > considered, > > then this query mechanism essentially is a mechanism for querying > quads, > > not triples! The graph name is no longer just an ignorable attribute > of > > triples, but is now an essential part of it. It appears to me that > there > > is a mismatch between RDF and SPARQL here. > > This seems key - the graph name is not ignorable. > > If a data provider publishes an RDF graph without further information, > then that data provider is responsible for that information. That is > the > background graph (default knowledge base). > > The unnamed graph is being published without further information (it's > just a > graph on the web) and as such it is the data provider who is publishing > it. > > By providing named graphs, we provide a way to export a graph without it > going > under the label of coming from the data provider. > > So the two choices are to require all information to be checked ("caveat > emptor" > - or trust until proven not to be trustworthy, default is to trust) or > to not > trust information until its provenance is verified (publishers are > responsible > for information they publish - applications add things into the space of > things > they trust, not remove them later). > > [[ > To refer to a different area: The Guardian newspapers styleguide: > http://www.guardian.co.uk/styleguide/article/0,5817,354123,00.html > for a discussion on naming sources in the newspaper : the question is > how can > the reader evaluate who to believe without information about the > source]] > > > Automatically, putting all triples in the unnamed graph is defaulting to > > trusting them because > > SELECT * WHERE { ?s ?p ?o } > > is taking the default for triples. That query should work whether there > are > additional named graphs in the dataset (which the application may not be > aware > of) or not, and also whether named graphs are added to the dataset > later. > Having it vary by whether the publisher has choosen not to place > the necessary information for checking in the dataset is very dangerous. > > It then comes down to whether the application writer is responsible for > checking all triples (the legal principle of caveat emptor) or whether > the > publisher is responsible for the background graphs they publish. > > > > There is a further technical issue as well: > > SELECT * WHERE { :foo :p ?x . :foo :q ?v } > > may find solutions but if the first triple pattern matches only in one > graph > and the second triple pattern only matches in a second graph, then there > is no > graph that matches the full pattern and you can't ask where it came from > yet > the query returns variable bindings. Why is the combined pattern a > graph match? > Because the publisher put all the triples together. > > Andy > > >
Received on Thursday, 31 March 2005 15:00:22 UTC