RE: LOAD, FROM, GRAPH from Seaborne, Andy on 2005-02-03 (public-rdf-dawg@w3.org from January to March 2005)

From: Seaborne, Andy <andy.seaborne@hp.com>
Date: Thu, 03 Feb 2005 11:44:33 +0000
To: Eric Prud'hommeaux <eric@w3.org>
Cc: public-rdf-dawg@w3.org
Message-ID: <42020EA1.3030403@hp.com>
-------- Original Message --------
> From: Eric Prud'hommeaux <>
> Date: 28 January 2005 16:26

Eric,

I'd rather not discuss trust models in rq23 and generally try to be neutral on 
the issues involved - you provide some compelling examples but there are others 
where treating the aggregation uniformly is undesirable - it enables one source 
to assert about a resource that is definitively reported by another.  See the
example in rq23 where the provenance information only is in the
background graph.  Checking later in the query to verify that each and
every triple was asserted by the right place can get very complicated and more 
importantly suggest "trust until shown otherwise".

I belive the dataset concept allows various setups, including the ones you give 
below.

> I argue below that there are implementation/specification costs to
> having a default graph that does not encompass the universe of
> knowledge. I also feel that it is architecturally purest if the KB
> represents what we know of the semantic web.

I don't see the implementation/specification costs:

Specification costs:

The "dataset" idea ensures all matching is graph matching, bNodes
included. It also tries to focus solely on naming, not trust.

If by "we know" you mean "we believe to be true" then, to me, this
argues for keeping the background graph separate from the named graphs,
which are information but not necessarily trusted to make assertions
about anything.

This set up is not precluded by the current design around "RDF datasets"
- it just isn't the only one.  A system that wishes to just handle the
single KB case is fine, as is the other.  It seems it is the directives
for declaring the dataset that cause the bias for one approach or the
other.


Implementation costs:

Case 1 : using quads :

"(s p o)" is either "quad(s p o ANY)" or "quad(s p o BG)" depending on which
case the system provides.  [[BG is the internal identifier for the
background graph.  It is never revealed.]]  Though of like this, I don't
see a significant difference in implementation costs.

Case 2 : using managed graphs :

This has an explicit collection of graphs and hence has to specially
handle the case of "(s p o)" over the combined collection. This is how I
have done it for maximum flexibility for small-scale use.  I have
implemented both ways round but I happened to have access to a "union
graph" from Jena (its used a lot in the ontology subsystem for owl:imports and 
in the inference subsystem).

For the merged case, the background graph is just the union of the named
graphs.  Else it is a plain, separate graph.  The work is in the union
graph (which was already written in my case) so the union case is slightly more
implementation work and in danger of being execution-time expensive.
But in the large scale case, I'd be move to internal quads for a whole
dataset anyway.

Jena's bNodes have system wide identify so this gives the meaning to the
union graph when the subgraphs have bNodes (I think a lot of systems do this as 
well).  This is a decision in Jena - I can't see that the RDF specs say one way 
of the other.

	Andy

> 
> If we decide to create named graphs outside the default KB, I feel we
> should make sure they can express the same aggregate semantics as we
> have adding graphs to the default KB. That is, by post-f2f4 parlance,
> FROM should take both a resource to read and a graph to create/append.
> For clarity, let's call it LOADINTO (called "read" in algae [ALGAE]):
> 
>   LOADINTO <http://...finace> <http://accountant.example/bobsBills>
>   LOADINTO <http://...finace> <http://joe.example/accounts/Bob>
>   WHERE { GRAPH <http://...finace> {
> 		      (?check f:payTo ?payee)
> 		      (?check f:amount ?amount)
> 		      (?check f:reference ?refStr) }
> 
> This allows the LOAD and LOADINTO to behave the same way, that is,
> queries act on a graph that is the aggregate of its inputs rather than
> a set of individual graphs.
> 
> Despite having put this in the algae language, I don't think this is
> worth specifying now. If the WG does want to specify this, I argue
> very strongly for going all the way with namable graphs. The named
> graphs in algae could also be named with unbound variables, which
> freed the querier of inventing safe names for the graphs:
> 
>   LOADINTO ?g1 <http://accountant.example/bobsBills>
>   LOADINTO ?g1 <http://joe.example/accounts/Bob>
>   WHERE { GRAPH ?g1 { (?check f:payTo ?payee)
> 		      (?check f:amount ?amount)
> 		      (?check f:reference ?refStr) }
> 
> 
> On Thu, Jan 27, 2005 at 10:30:25AM -0500, Eric Prud'hommeaux wrote:
> > You may want COFFEE before plowing through this mail on LOAD, FROM
and
> > GRAPH.
> > 
> > We have decided as a WG that we need access to provenance
information
> > in the knowlege base (KB). Day 1 of the Helsinki (Espoo) face to
face
> > ended with a bit of education and a debate on how to use that
> > provenance information. Here are, I believe, both sides of the
> > argument. I favor the SINGLE KB option outlined first:
> > 
> > 
> > == SINGLE KB ==
> > 
> > The simplest way to model provenance is to tag triples in the KB
with
> > their origin. Once can do this with a KB containing potentially
> > overlapping sets of triples [FORMULAS], a single set of quads
[QUADS],
> > or a set of triples with a provenance list associated with each
> > triple. Regardless of the implementation, there is a single KB that
> > knows everything the system know. (There are probably other
practical
> > ways to do this as well.) A query like
> > 
> > 
> >   DEFAULT TRUST
> >   -------------
> >   PREFIX f : <http://accounting.example/schema#>
> >   SELECT ?payee, ?amount, ?refStr
> >     LOAD <http://accountant.example/bobsBills>
> >          <http://joe.example/accounts/Bob>
> >    WHERE { (?check f:payTo ?payee)
> > 	   (?check f:amount ?amount)
> > 	   (?check f:reference ?refStr) }
> > 
> > reads bobsBills and accounts/Bob into the KB where it is available
for
> > matching the graphPattern
> >   (?check f:payTo ?payee)
> >   (?check f:amount ?amount)
> >   (?check f:reference ?refStr) .
> > 
> > Per discussions earlier in this WG, the graphPattern still matches
> > if (?check f:payTo ?payee) and (?check f:amount ?payee) come from
> > bobsBills and (?check f:reference ?refStr) comes from accounts/Bob.
> > 
> > If one only trusts statements from the accountant, one can phrase
the
> > question as
> > 
> > 
> >   SINGLE TRUST DOMAIN
> >   -------------------
> >   PREFIX f : <http://accounting.example/schema#>
> >   PREFIX a : <http://accountant.example/>
> >   PREFIX h : <http://joe.example/accounts>
> >   SELECT ?payee, ?amount, ?refStr
> >     LOAD a:bobsBills h:Bob
> >    WHERE { GRAPH d:bobsBills { (?check f:payTo ?payee)
> > 			       (?check f:amount ?amount)
> > 			       (?check f:reference ?refStr) }
> > 
> > If, as is more likely, Bob trusts his accountant to write the name
and
> > amount on the checks but lets the Joe specify what is in the memo
> > field, he can write the query to reflect that predicated trust:
> > 
> > 
> >   PREDICATED TRUST
> >   ----------------
> >   PREFIX f : <http://accounting.example/schema#>
> >   PREFIX a : <http://accountant.example/>
> >   PREFIX h : <http://joe.example/accounts>
> >   SELECT ?payee, ?amount, ?refStr
> >     LOAD a:bobsBills h:Bob
> >    WHERE { GRAPH d:bobsBills { (?check f:payTo ?payee)
> > 			       (?check f:amount ?amount) }
> > 	   GRAPH j:Bob       { (?check f:reference ?refStr) }
> > 
> > 
> >   SIDE EFFECTS
> >   ------------
> > In systems that continually learn (Ontaria, the Annotea database,
> > various Googles of the semantic web), the notion of a single trust
> > domain is dangerous. Without controlling what sort of data is in the
> > KB, one shouldn't let it write checks without you checking the
> > data. The users of the KBs listed above have practical queries "give
> > me the annotations for a page X, or, tell me about schema Y" that
are
> > well served by "trusting" everyone for the application's notion of
> > trust. Users can rely on predicated trust when they require more
> > security.
> > 
> > 
> > 
> > == MULTIPLE KBS ==
> > 
> > TimBL raised the default trust issue [TIMBL]. The basic issue was
that
> > the import of a resource into the KB implied the trust in the
> > assertions from that document. We can presume one would not import a
> > document with no potentially intersting statements. Multiple KBs
> > provides a way to query a subset of the statements in a resource
> > without having other statements in that resource give us potentially
> > misleading information. For instance, a semantic google query of
> > documents about X should not cause us to believe everything we read
on
> > the net.
> > 
> > This is accomplished by having a verb FROM <resource> that imports
> > data that is only matchable by graphPatterns that explicilty
identify
> > that resource. Thus
> >     FROM h.Bob
> >    WHERE { (?check f:reference ?refStr) }
> > will not match any statements from h.Bob. Only
> >     FROM h.Bob
> >    WHERE { GRAPH h:Bob (?check f:reference ?refStr) }
> > will match those statements.
> > 
> > 
> >   DEFAULT TRUST
> >   -------------
> > Without using FROM and explicit GRAPH constraints, queries behave
the
> > same as in the single KB model. The default trust query above will
> > still match (?check f:payTo ?payee), (?check f:amount ?payee) and
> > (?check f:reference ?refStr) coming from any combination of
> > a:bobsBills and h:Bob .
> > 
> > 
> >   SINGLE TRUST DOMAIN
> >   -------------------
> >   PREFIX f : <http://accounting.example/schema#>
> >   PREFIX a : <http://accountant.example/>
> >   PREFIX h : <http://joe.example/accounts>
> >   SELECT ?payee, ?amount, ?refStr
> >     LOAD a:bobsBills
> >     FROM h:Bob
> >    WHERE { (?check f:payTo ?payee)
> > 	   (?check f:amount ?amount)
> > 	   (?check f:reference ?refStr) }
> > 
> > This simplifies trusting a single document. In addition, it makes it
> > possible to trust the interaction between triples in a LOAD'd
document
> > with the triples in the default KB, while still not trusting triples
> > from FROM'd documents.
> > 
> > 
> >   PREDICATED TRUST
> >   ----------------
> > Partial trust of a set of documents behave the same was in either
> > approach.
> > 
> > 
> >   SIDE EFFECTS
> >   ------------
> > 
> > One can't rely on the single trust domain model if the database
allows
> > side effects. In fact, the user must specifically know that nothing
> > in the database could be harmful. One query could LOAD <X> into
> > the database and a subsequent query could use FROM <X>, expecting
the
> > data from <X> to *not* be in the database.
> > 
> > 
> > 
> > == COMMENTS ==
> > 
> > The Multiple KBs creates alternate KBs, or, if you will, creates a
> > subset of the KB which graphPatterns without a GRAPH target can
match.
> > (The difference is just a matter of what you call the KB.)
> > 
> > Mulitple DB++
> > The single trust domain case is terser and more expressive in the
> > multiple DB. In order to access the interaction between LOAD'd
triples
> > and the default DB, the service provider would have to provide a
GRAPH
> > name for those triples.
> > 
> > Multiple DB--
> > 
> > It is either impossible to query interaction between triples from
> > FROM'd documents, or it is at least ill-defined. Should
> >   PREFIX f : <http://accounting.example/schema#>
> >   PREFIX a : <http://accountant.example/>
> >   PREFIX h : <http://joe.example/accounts>
> >   SELECT ?payee, ?amount, ?refStr
> >     FROM a:bobsBills h:Bob
> >    WHERE { GRAPH ?d { (?check f:payTo ?payee)
> > 		      (?check f:amount ?amount)
> > 		      (?check f:reference ?refStr) }
> >     (?d URI= a:bobsBills || ?d URI= h:Bob)
> > ask the graphPattern of both documents, or of the aggregation of
> > those documents. If the former, users will have to be aware that
> > the interaction behavior of FROM is different and less expressive.
> > If that idiom forces the aggregation of the two documents, the
> > burden on implementations is much higher as they need to both
> > detect the aggregation patterns and create an arbitrary number of
> > aggregate KBs (rather than have a single KB and enforce GRAPH
> > constraints as simple row restrictions (also inscrutably called
> > "SELECT" in relational algebra).
> > 
> > In short, I don't think that multiple DB approach is worth the
> > impelementation/specification burden. I doubt that the aggregation
of
> > FROM'd graphs is the only screw case I can come up with.
> > 
> > 
> > [FORMULAS] http://www.w3.org/2001/12/attributions/#formulas
> > [QUADS] http://www.w3.org/2001/12/attributions/#quads
> > [TIMBL]
> >
http://lists.w3.org/Archives/Public/public-rdf-dawg-comments/2004Nov/002
0.html
> [ALGAE] http://www.w3.org/2004/05/06-Algae/#doc-algae-slurpStr
> --
> -eric
> 
> office: +81.466.49.1170 W3C, Keio Research Institute at SFC,
>                         Shonan Fujisawa Campus, Keio University,
>                         5322 Endo, Fujisawa, Kanagawa 252-8520
>                         JAPAN
>         +1.617.258.5741 NE43-344, MIT, Cambridge, MA 02144 USA
> cell:   +81.90.6533.3882
> 
> (eric@w3.org)
> Feel free to forward this message to any list for any purpose other
than
> email address distribution.
Received on Thursday, 3 February 2005 11:44:50 UTC