LOAD, FROM, GRAPH and COFFEE from Eric Prud'hommeaux on 2005-01-27 (public-rdf-dawg@w3.org from January to March 2005)

From: Eric Prud'hommeaux <eric@w3.org>
Date: Thu, 27 Jan 2005 10:30:25 -0500
To: public-rdf-dawg@w3.org
Message-ID: <20050127153025.GA28735@w3.org>
You may want COFFEE before plowing through this mail on LOAD, FROM and
GRAPH.

We have decided as a WG that we need access to provenance information
in the knowlege base (KB). Day 1 of the Helsinki (Espoo) face to face
ended with a bit of education and a debate on how to use that
provenance information. Here are, I believe, both sides of the
argument. I favor the SINGLE KB option outlined first:


== SINGLE KB ==

The simplest way to model provenance is to tag triples in the KB with
their origin. Once can do this with a KB containing potentially
overlapping sets of triples [FORMULAS], a single set of quads [QUADS],
or a set of triples with a provenance list associated with each
triple. Regardless of the implementation, there is a single KB that
knows everything the system know. (There are probably other practical
ways to do this as well.) A query like


  DEFAULT TRUST
  -------------
  PREFIX f : <http://accounting.example/schema#>
  SELECT ?payee, ?amount, ?refStr
    LOAD <http://accountant.example/bobsBills>
         <http://joe.example/accounts/Bob>
   WHERE { (?check f:payTo ?payee)
	   (?check f:amount ?amount)
	   (?check f:reference ?refStr) }

reads bobsBills and accounts/Bob into the KB where it is available for
matching the graphPattern
  (?check f:payTo ?payee)
  (?check f:amount ?amount)
  (?check f:reference ?refStr) .

Per discussions earlier in this WG, the graphPattern still matches
if (?check f:payTo ?payee) and (?check f:amount ?payee) come from
bobsBills and (?check f:reference ?refStr) comes from accounts/Bob.

If one only trusts statements from the accountant, one can phrase the
question as


  SINGLE TRUST DOMAIN
  -------------------
  PREFIX f : <http://accounting.example/schema#>
  PREFIX a : <http://accountant.example/>
  PREFIX h : <http://joe.example/accounts>
  SELECT ?payee, ?amount, ?refStr
    LOAD a:bobsBills h:Bob
   WHERE { GRAPH d:bobsBills { (?check f:payTo ?payee)
			       (?check f:amount ?amount)
			       (?check f:reference ?refStr) }

If, as is more likely, Bob trusts his accountant to write the name and
amount on the checks but lets the Joe specify what is in the memo
field, he can write the query to reflect that predicated trust:


  PREDICATED TRUST
  ----------------
  PREFIX f : <http://accounting.example/schema#>
  PREFIX a : <http://accountant.example/>
  PREFIX h : <http://joe.example/accounts>
  SELECT ?payee, ?amount, ?refStr
    LOAD a:bobsBills h:Bob
   WHERE { GRAPH d:bobsBills { (?check f:payTo ?payee)
			       (?check f:amount ?amount) }
	   GRAPH j:Bob       { (?check f:reference ?refStr) }


  SIDE EFFECTS
  ------------
In systems that continually learn (Ontaria, the Annotea database,
various Googles of the semantic web), the notion of a single trust
domain is dangerous. Without controlling what sort of data is in the
KB, one shouldn't let it write checks without you checking the
data. The users of the KBs listed above have practical queries "give
me the annotations for a page X, or, tell me about schema Y" that are
well served by "trusting" everyone for the application's notion of
trust. Users can rely on predicated trust when they require more
security.



== MULTIPLE KBS ==

TimBL raised the default trust issue [TIMBL]. The basic issue was that
the import of a resource into the KB implied the trust in the
assertions from that document. We can presume one would not import a
document with no potentially intersting statements. Multiple KBs
provides a way to query a subset of the statements in a resource
without having other statements in that resource give us potentially
misleading information. For instance, a semantic google query of
documents about X should not cause us to believe everything we read on
the net.

This is accomplished by having a verb FROM <resource> that imports
data that is only matchable by graphPatterns that explicilty identify
that resource. Thus
    FROM h.Bob
   WHERE { (?check f:reference ?refStr) }
will not match any statements from h.Bob. Only 
    FROM h.Bob
   WHERE { GRAPH h:Bob (?check f:reference ?refStr) }
will match those statements.


  DEFAULT TRUST
  -------------
Without using FROM and explicit GRAPH constraints, queries behave the
same as in the single KB model. The default trust query above will
still match (?check f:payTo ?payee), (?check f:amount ?payee) and
(?check f:reference ?refStr) coming from any combination of
a:bobsBills and h:Bob .


  SINGLE TRUST DOMAIN
  -------------------
  PREFIX f : <http://accounting.example/schema#>
  PREFIX a : <http://accountant.example/>
  PREFIX h : <http://joe.example/accounts>
  SELECT ?payee, ?amount, ?refStr
    LOAD a:bobsBills
    FROM h:Bob
   WHERE { (?check f:payTo ?payee)
	   (?check f:amount ?amount)
	   (?check f:reference ?refStr) }

This simplifies trusting a single document. In addition, it makes it
possible to trust the interaction between triples in a LOAD'd document
with the triples in the default KB, while still not trusting triples
from FROM'd documents.


  PREDICATED TRUST
  ----------------
Partial trust of a set of documents behave the same was in either
approach.


  SIDE EFFECTS
  ------------

One can't rely on the single trust domain model if the database allows
side effects. In fact, the user must specifically know that nothing
in the database could be harmful. One query could LOAD <X> into
the database and a subsequent query could use FROM <X>, expecting the
data from <X> to *not* be in the database.



== COMMENTS ==

The Multiple KBs creates alternate KBs, or, if you will, creates a
subset of the KB which graphPatterns without a GRAPH target can match.
(The difference is just a matter of what you call the KB.)

Mulitple DB++
The single trust domain case is terser and more expressive in the
multiple DB. In order to access the interaction between LOAD'd triples
and the default DB, the service provider would have to provide a GRAPH
name for those triples.

Multiple DB--

It is either impossible to query interaction between triples from
FROM'd documents, or it is at least ill-defined. Should 
  PREFIX f : <http://accounting.example/schema#>
  PREFIX a : <http://accountant.example/>
  PREFIX h : <http://joe.example/accounts>
  SELECT ?payee, ?amount, ?refStr
    FROM a:bobsBills h:Bob
   WHERE { GRAPH ?d { (?check f:payTo ?payee)
		      (?check f:amount ?amount)
		      (?check f:reference ?refStr) }
    (?d URI= a:bobsBills || ?d URI= h:Bob)
ask the graphPattern of both documents, or of the aggregation of
those documents. If the former, users will have to be aware that
the interaction behavior of FROM is different and less expressive.
If that idiom forces the aggregation of the two documents, the
burden on implementations is much higher as they need to both
detect the aggregation patterns and create an arbitrary number of
aggregate KBs (rather than have a single KB and enforce GRAPH
constraints as simple row restrictions (also inscrutably called
"SELECT" in relational algebra).

In short, I don't think that multiple DB approach is worth the
impelementation/specification burden. I doubt that the aggregation of
FROM'd graphs is the only screw case I can come up with.


[FORMULAS] http://www.w3.org/2001/12/attributions/#formulas
[QUADS] http://www.w3.org/2001/12/attributions/#quads
[TIMBL] http://lists.w3.org/Archives/Public/public-rdf-dawg-comments/2004Nov/0020.html
-- 
-eric

office: +81.466.49.1170 W3C, Keio Research Institute at SFC,
                        Shonan Fujisawa Campus, Keio University,
                        5322 Endo, Fujisawa, Kanagawa 252-8520
                        JAPAN
        +1.617.258.5741 NE43-344, MIT, Cambridge, MA 02144 USA
cell:   +81.90.6533.3882

(eric@w3.org)
Feel free to forward this message to any list for any purpose other than
email address distribution.
Received on Thursday, 27 January 2005 15:30:26 UTC