LOAD, FROM, GRAPH from Eric Prud'hommeaux on 2005-01-28 (public-rdf-dawg@w3.org from January to March 2005)

From: Eric Prud'hommeaux <eric@w3.org>
Date: Fri, 28 Jan 2005 11:26:19 -0500
To: Pat Hayes <phayes@ihmc.us>
Cc: public-rdf-dawg@w3.org
Message-ID: <20050128162619.GB3069@w3.org>
I argue below that there are implementation/specification costs to
having a default graph that does not encompass the universe of
knowledge. I also feel that it is architecturally purest if the KB
represents what we know of the semantic web.

If we decide to create named graphs outside the default KB, I feel we
should make sure they can express the same aggregate semantics as we
have adding graphs to the default KB. That is, by post-f2f4 parlance,
FROM should take both a resource to read and a graph to create/append.
For clarity, let's call it LOADINTO (called "read" in algae [ALGAE]):

  LOADINTO <http://...finace> <http://accountant.example/bobsBills>
  LOADINTO <http://...finace> <http://joe.example/accounts/Bob>
  WHERE { GRAPH <http://...finace> {
		      (?check f:payTo ?payee)
		      (?check f:amount ?amount)
		      (?check f:reference ?refStr) }

This allows the LOAD and LOADINTO to behave the same way, that is,
queries act on a graph that is the aggregate of its inputs rather than
a set of individual graphs.

Despite having put this in the algae language, I don't think this is
worth specifying now. If the WG does want to specify this, I argue
very strongly for going all the way with namable graphs. The named
graphs in algae could also be named with unbound variables, which
freed the querier of inventing safe names for the graphs:

  LOADINTO ?g1 <http://accountant.example/bobsBills>
  LOADINTO ?g1 <http://joe.example/accounts/Bob>
  WHERE { GRAPH ?g1 { (?check f:payTo ?payee)
		      (?check f:amount ?amount)
		      (?check f:reference ?refStr) }


On Thu, Jan 27, 2005 at 10:30:25AM -0500, Eric Prud'hommeaux wrote:
> You may want COFFEE before plowing through this mail on LOAD, FROM and
> GRAPH.
> 
> We have decided as a WG that we need access to provenance information
> in the knowlege base (KB). Day 1 of the Helsinki (Espoo) face to face
> ended with a bit of education and a debate on how to use that
> provenance information. Here are, I believe, both sides of the
> argument. I favor the SINGLE KB option outlined first:
> 
> 
> == SINGLE KB ==
> 
> The simplest way to model provenance is to tag triples in the KB with
> their origin. Once can do this with a KB containing potentially
> overlapping sets of triples [FORMULAS], a single set of quads [QUADS],
> or a set of triples with a provenance list associated with each
> triple. Regardless of the implementation, there is a single KB that
> knows everything the system know. (There are probably other practical
> ways to do this as well.) A query like
> 
> 
>   DEFAULT TRUST
>   -------------
>   PREFIX f : <http://accounting.example/schema#>
>   SELECT ?payee, ?amount, ?refStr
>     LOAD <http://accountant.example/bobsBills>
>          <http://joe.example/accounts/Bob>
>    WHERE { (?check f:payTo ?payee)
> 	   (?check f:amount ?amount)
> 	   (?check f:reference ?refStr) }
> 
> reads bobsBills and accounts/Bob into the KB where it is available for
> matching the graphPattern
>   (?check f:payTo ?payee)
>   (?check f:amount ?amount)
>   (?check f:reference ?refStr) .
> 
> Per discussions earlier in this WG, the graphPattern still matches
> if (?check f:payTo ?payee) and (?check f:amount ?payee) come from
> bobsBills and (?check f:reference ?refStr) comes from accounts/Bob.
> 
> If one only trusts statements from the accountant, one can phrase the
> question as
> 
> 
>   SINGLE TRUST DOMAIN
>   -------------------
>   PREFIX f : <http://accounting.example/schema#>
>   PREFIX a : <http://accountant.example/>
>   PREFIX h : <http://joe.example/accounts>
>   SELECT ?payee, ?amount, ?refStr
>     LOAD a:bobsBills h:Bob
>    WHERE { GRAPH d:bobsBills { (?check f:payTo ?payee)
> 			       (?check f:amount ?amount)
> 			       (?check f:reference ?refStr) }
> 
> If, as is more likely, Bob trusts his accountant to write the name and
> amount on the checks but lets the Joe specify what is in the memo
> field, he can write the query to reflect that predicated trust:
> 
> 
>   PREDICATED TRUST
>   ----------------
>   PREFIX f : <http://accounting.example/schema#>
>   PREFIX a : <http://accountant.example/>
>   PREFIX h : <http://joe.example/accounts>
>   SELECT ?payee, ?amount, ?refStr
>     LOAD a:bobsBills h:Bob
>    WHERE { GRAPH d:bobsBills { (?check f:payTo ?payee)
> 			       (?check f:amount ?amount) }
> 	   GRAPH j:Bob       { (?check f:reference ?refStr) }
> 
> 
>   SIDE EFFECTS
>   ------------
> In systems that continually learn (Ontaria, the Annotea database,
> various Googles of the semantic web), the notion of a single trust
> domain is dangerous. Without controlling what sort of data is in the
> KB, one shouldn't let it write checks without you checking the
> data. The users of the KBs listed above have practical queries "give
> me the annotations for a page X, or, tell me about schema Y" that are
> well served by "trusting" everyone for the application's notion of
> trust. Users can rely on predicated trust when they require more
> security.
> 
> 
> 
> == MULTIPLE KBS ==
> 
> TimBL raised the default trust issue [TIMBL]. The basic issue was that
> the import of a resource into the KB implied the trust in the
> assertions from that document. We can presume one would not import a
> document with no potentially intersting statements. Multiple KBs
> provides a way to query a subset of the statements in a resource
> without having other statements in that resource give us potentially
> misleading information. For instance, a semantic google query of
> documents about X should not cause us to believe everything we read on
> the net.
> 
> This is accomplished by having a verb FROM <resource> that imports
> data that is only matchable by graphPatterns that explicilty identify
> that resource. Thus
>     FROM h.Bob
>    WHERE { (?check f:reference ?refStr) }
> will not match any statements from h.Bob. Only 
>     FROM h.Bob
>    WHERE { GRAPH h:Bob (?check f:reference ?refStr) }
> will match those statements.
> 
> 
>   DEFAULT TRUST
>   -------------
> Without using FROM and explicit GRAPH constraints, queries behave the
> same as in the single KB model. The default trust query above will
> still match (?check f:payTo ?payee), (?check f:amount ?payee) and
> (?check f:reference ?refStr) coming from any combination of
> a:bobsBills and h:Bob .
> 
> 
>   SINGLE TRUST DOMAIN
>   -------------------
>   PREFIX f : <http://accounting.example/schema#>
>   PREFIX a : <http://accountant.example/>
>   PREFIX h : <http://joe.example/accounts>
>   SELECT ?payee, ?amount, ?refStr
>     LOAD a:bobsBills
>     FROM h:Bob
>    WHERE { (?check f:payTo ?payee)
> 	   (?check f:amount ?amount)
> 	   (?check f:reference ?refStr) }
> 
> This simplifies trusting a single document. In addition, it makes it
> possible to trust the interaction between triples in a LOAD'd document
> with the triples in the default KB, while still not trusting triples
> from FROM'd documents.
> 
> 
>   PREDICATED TRUST
>   ----------------
> Partial trust of a set of documents behave the same was in either
> approach.
> 
> 
>   SIDE EFFECTS
>   ------------
> 
> One can't rely on the single trust domain model if the database allows
> side effects. In fact, the user must specifically know that nothing
> in the database could be harmful. One query could LOAD <X> into
> the database and a subsequent query could use FROM <X>, expecting the
> data from <X> to *not* be in the database.
> 
> 
> 
> == COMMENTS ==
> 
> The Multiple KBs creates alternate KBs, or, if you will, creates a
> subset of the KB which graphPatterns without a GRAPH target can match.
> (The difference is just a matter of what you call the KB.)
> 
> Mulitple DB++
> The single trust domain case is terser and more expressive in the
> multiple DB. In order to access the interaction between LOAD'd triples
> and the default DB, the service provider would have to provide a GRAPH
> name for those triples.
> 
> Multiple DB--
> 
> It is either impossible to query interaction between triples from
> FROM'd documents, or it is at least ill-defined. Should 
>   PREFIX f : <http://accounting.example/schema#>
>   PREFIX a : <http://accountant.example/>
>   PREFIX h : <http://joe.example/accounts>
>   SELECT ?payee, ?amount, ?refStr
>     FROM a:bobsBills h:Bob
>    WHERE { GRAPH ?d { (?check f:payTo ?payee)
> 		      (?check f:amount ?amount)
> 		      (?check f:reference ?refStr) }
>     (?d URI= a:bobsBills || ?d URI= h:Bob)
> ask the graphPattern of both documents, or of the aggregation of
> those documents. If the former, users will have to be aware that
> the interaction behavior of FROM is different and less expressive.
> If that idiom forces the aggregation of the two documents, the
> burden on implementations is much higher as they need to both
> detect the aggregation patterns and create an arbitrary number of
> aggregate KBs (rather than have a single KB and enforce GRAPH
> constraints as simple row restrictions (also inscrutably called
> "SELECT" in relational algebra).
> 
> In short, I don't think that multiple DB approach is worth the
> impelementation/specification burden. I doubt that the aggregation of
> FROM'd graphs is the only screw case I can come up with.
> 
> 
> [FORMULAS] http://www.w3.org/2001/12/attributions/#formulas
> [QUADS] http://www.w3.org/2001/12/attributions/#quads
> [TIMBL] http://lists.w3.org/Archives/Public/public-rdf-dawg-comments/2004Nov/0020.html
[ALGAE] http://www.w3.org/2004/05/06-Algae/#doc-algae-slurpStr
-- 
-eric

office: +81.466.49.1170 W3C, Keio Research Institute at SFC,
                        Shonan Fujisawa Campus, Keio University,
                        5322 Endo, Fujisawa, Kanagawa 252-8520
                        JAPAN
        +1.617.258.5741 NE43-344, MIT, Cambridge, MA 02144 USA
cell:   +81.90.6533.3882

(eric@w3.org)
Feel free to forward this message to any list for any purpose other than
email address distribution.
Received on Friday, 28 January 2005 16:26:20 UTC