Proposal: querying untrusted graphs

-- Situation

This note describes one way in which we might change the SPARQL
query spec to handle untrusted graphs.  It does not provide
everything - graphs can not be loaded mid-query based on earlier
parts of the patten matching process.  Queries only
execute in an environment set at the start of query execution.

See also: UC&R:
http://www.w3.org/2001/sw/DataAccess/UseCases#d4.2

There are some straw poll questions at the end of this message.

-- Conceptual changes

A query is executed against a single unnamed graph (the default graph)
and a collection of named graphs.  The change is that there is no merge
relationship between the default graph and the named graphs.

Called this default graph + named graphs, an "RDF dataset".

UC&R Design Objective 4.2 points 1 & 2 are covered.  The 3rd point
suggests a single (trusted) interpretation when specifying multiple graphs
(while it does not actually say that the sources need be available in the
aggregation/merge case, I think this was the intent).

The test framework needs changing to express this but it needs changing anyway.

SOURCE is unchanged : much of this message is examples of changes to FROM.

-- Usages

There are two classes of use I have in mind: query against existing data
sets and querying against ad hoc datasets.

Large data providers may well have a fixed dataset and it is implicit in
using the query service.  The dataset is not named in the query - no FROM
needed.  The only possible use of FROM is restriction within the graphs
already available at the service point and even this is optional.

The other case is more query-as-script: the dataset is built for the
query and so there is a need for some construction mechanism to describe it.

-- SOURCE Changes

None.  SOURCE works as it always does - it accesses labels of graphs.

It's the thing being accessed that has changed.

The examples in rq23 9.1 and 9.2 need to change though.  The current
example in 9.3 looks OK.

-- FROM Changes

If we want syntax for construction of the dataset, then we have to consider
placing graphs in the dataset and defining the default graph.

Version 1:
Uses new keywords to define the dataset.
    FROM for adding to the default graph
    GRAPH to add a (named) graph

Version 2:
Uses a compact syntax in the FROM clause.

The design should work well in the simple cases, and be tolerable for more
complex examples (and it need not cover all cases) - I'm assuming that the
more complicated setups would be the ones where the dataset is passed in
from the query context, and less often defined by the query.  In other
words, if your dataset definitions are longer than your query patterns, it
may be time for a redesign :-)

Use of a URI means "the graph associated with" - not necessarily
"load current"; it does not imply access at query time. It may be a
restriction over the graphs in the query context and causes an error
if it can't be satisfied.

-- Keyword syntax for datasets

Examples:

# Ex1 - put the graph identified by <u1> in the default graph
FROM <u1>

# Ex2 - put the graphs identified by <u1> and <u2> in the default graph
FROM <u1> <u2>

That is, merge them into the default graph.  Unnamed.  Other RDF triples may
be present in the default graph.

# Ex3 - use graphs associated with <w1> and <w2>
# as named graphs with names <w1> and <w2>

GRAPH <w1> <w2>

-- Compact syntax for datasets

Compact representation: a dataset is

FROM  <u1> (<w1> <w2>)

is the same as:

FROM <u1>
GRAPH <w1> <w2>

I think this is rather cryptic when URIs are long and prefer (mildly) the 
keyword form.

-- What's lost

In the trusted graph (the default graph) there is no tracking of where
triples came from.  The data provider should publish a dataset and let the
client decide whether they trust the named graphs or not.  If the publisher
is publishing the believed aggregation, it should put its name on it.

-- Protocol

The protocol will need to reflect the construction of datasets or leave
handling of it to the query language.  There isn't a protocol proper in
local use so it would be useful in the query language.

This suggests a service oriented protocol paradigm.  Either the dataset is
implicit because the request was directed to a particular service instance,
or the query language expresses the dataset and the service offers various
degrees of dataset formation.

-- Summary

This note outlines a possible solution so that we are deciding between two
proposals:  the current situation with merge and a change to no default merge.
(please express your opinion here)

+1 from Andy to change to no default merge

Within that we can choose to drop FROM and protocol graph naming.
At the moment, I think we should explore a mechanism for dataset building
(FROM and in the protocol).
(please express your opinion here)

+1 from Andy to at least attempt a design here. If it does
not look like converging, by LC, we drop it.

-- Next steps

If this looks plausible, based on WG members opinions (and anyone else
reading this), then we can start with alternative versions of rq23 sections
8 and 9 and work on test cases.  Both versions of sections 8 & 9 could be
published next working draft, then we pick one and go with that.

Received on Monday, 29 November 2004 09:53:33 UTC