Re: Replies to substantive comments on SPARQL Query Language for RDF from Peter F. Patel-Schneider on 2007-05-29 (public-rdf-dawg-comments@w3.org from May 2007)

From: Peter F. Patel-Schneider <pfps@research.bell-labs.com>
Date: Tue, 29 May 2007 12:43:48 -0400 (EDT)
To: eric@w3.org
Cc: public-rdf-dawg-comments@w3.org
Message-Id: <20070529.124348.126767512.pfps@research.bell-labs.com>
> Peter, thank you for your comments. This reply includes reponses to
> substantive comments 1, 3, 8 and 9.
> 
> >1/ A question on the basic notion of RDF.
> >
> >>From the document: 
> >
> >Abstract: RDF is a directed, labeled graph data format for representing
> >information in the Web. 
> >
> >>From RDF Concepts:
> >
> >Abstract: The Resource Description Framework (RDF) is a framework for
> >representing information in the Web. 
> >
> >How can these two different, to me, views of RDF be reconciled?  If
> >SPARQL treats RDF as simply a "graph data format" then what is the
> >status of the RDF semantics, which goes much further?   Suppose I write
> >a system for handling RDF that respects the RDF recommendations, is this
> >system going to be useful for SPARQL?  For example, if I store RDF
> >graphs in some internal canonical form (for example, changing
> >"042"^^xsd:integer to "42"^^xsd:integer) then I have changed the SPARQL
> >answers.
> 
> While there are many choices the WG could have made here, the charter
> specified that the SPARQL be defined in terms of a graph, 

[...]

Well, then the document should start with something like:

	Although the Resource Description Framework (RDF) is a framework
	for representing information in the Web, the SPARQL Query
	Language is concerned only with RDF graphs, i.e., the RDF data
	model.  These RDF graphs may in practice be a virtual graph
	which would follow from some form of inference from a stored
	graph, but the definition of the SPARQL Query Language itself is
	silent on where or how the graph was produced.

However, I still think that this decision is a large mistake.


>   http://www.w3.org/2003/12/swa/dawg-charter#scope
>   [[
>   The principal task of the RDF Data Access Working Group is to gather
>   requirements and to define an HTTP and/or SOAP-based protocol for
>   selecting instances of subgraphs from an RDF graph.
>   ]]
> 
> and that the entailments of various semantics be reflected in the
> graph.
> 
>   http://www.w3.org/2003/12/swa/dawg-charter#rdfs-owl-queries
>   [[
>   The protocol will allow access to a notional RDF graph. This may in
>   practice be the virtual graph which would follow from some form of
>   inference from a stored graph. This does not affect the data access
>   protocol, but may affect the description of the data access
>   service. For example, if OWL DL semantics are supported by a
>   service, that may be evident in the description of the service or
>   the virtual graph which is queried, but it will not affect the
>   protocol designed under this charter.
>   ]]

I don't go nearly so far in interpreting this part of the SPARQL charter.

> In this regard, SPARQL's graph matching corresponds to Graph
> Equivalence in section 6.3 of the Concepts and Abstract Syntax.
> 
>  
> http://www.w3.org/TR/2004/REC-rdf-concepts-20040210/#section-graph-equality
>   [[
>   Two RDF graphs G and G' are equivalent if there is a bijection M
>   between the sets of nodes of the two graphs, such that:
> 
>    1. M maps blank nodes to blank nodes.
>    2. M(lit)=lit for all RDF literals lit which are nodes of G.
>    3. M(uri)=uri for all RDF URI references uri which are nodes of G.
>    4. The triple ( s, p, o ) is in G if and only if the triple ( M(s), p, 
> M(o) ) is in G'
> 
>   With this definition, M shows how each blank node in G can be
>   replaced with a new blank node to give G'.
>   ]]
> 
> With this in mind, SPARQL is defined for simple entailment. From the 
> Status of the Document:
> 
> [[
> Compared to previous versions, this document adds an algebra for SPARQL 
> that 
> provides semantics for evaluating graph patterns and solution modifiers. 
> This 
> algebra is defined over basic graph pattern matching for simple 
> entailment. 
> This document also provides conditions for extending SPARQL basic graph 
> pattern matching.
> ]]

Hmm.    I don't see how simple entailment fits into the SPARQL story.
See also below.

> The document does anticipate richer semantics, but the Working Group notes 
> that querying richer semantic models is an open research problem, and has 
> declined to standardize beyond simple entailment at this time:
> 
> [[
> 12.6 Extending SPARQL Basic Graph Matching
> 
> The overall SPARQL design can be used for queries which assume a more 
> elaborate form of entailment than simple entailment, by re-writing the 
> matching conditions for basic graph patterns. Since it is an open research 
> 
> problem to state such conditions in a single general form which applies to 
> all 
> forms of entailment and optimally eliminates needless or inappropriate 
> redundancy, this document only gives necessary conditions which any such 
> solution should satisfy. These will need to be extended to full 
> definitions 
> for each particular case.
> ]]
> 
> As you observed a while back ( 
> http://lists.w3.org/Archives/Public/public-rdf-dawg-comments/2005Sep/0036.html 
> ), this behavior is in line with the group's charter:
> 
> [[
> ISSUE 1 (show-stopper):  Non-respect for RDF Semantics
> 
> I realize that this interacts with the charter of the group.  What this 
> means
> to me is that the charter is ill-formed with respect to RDF.
> ]]

Yes, and I stand by that earlier comment.

At the very least there need to be adequate explanation *and* tests to
fully demonstrate the effects of this decision.


> > 3/ Matching literals
> > 
> > I was very surprised to see that the exact literal form of an RDF
> > literal is significant (Section 2.3.3).  Imagine what would happen if an
> > SQL query depended on the exact literal form in which numbers were
> > entered into a database!
> 
> SPARQL is defined for simple entailment as noted above. As per 6.5.1 of
> RDF concepts:
> 
> http://www.w3.org/TR/rdf-concepts/#section-Graph-Literal
> [[
> 6.5.1 Literal Equality
> 
> Two literals are equal if and only if all of the following hold:
> 
>      * The strings of the two lexical forms compare equal, character by
> character.
>      * Either both or neither have language tags.
>      * The language tags, if any, compare equal.
>      * Either both or neither have datatype URIs.
>      * The two datatype URIs, if any, compare equal, character by
> character.
> ]]
> 
> SPARQL does accomodate value matching with FILTER expressions, which are
> defined by XPath Functions and Operators.
> 
> RDF semantics talks about datatyped interpretations and D-entailment:
> http://www.w3.org/TR/rdf-mt/#dtype_interp
> 
> Based on current implementation practice, the working group decided to
> leave D-entailment as a research problem. This includes richer semantics
> for graph patterns such as:
> 
>   [] rdf:type xsd:nonNegativeInteger .
> 
> The Working Group has approved tests in this space. See, for example:
> 
>   http://www.w3.org/2001/sw/DataAccess/tests/#open-eq-01
> 
> (But note that that's in the old test space and will be moving in the next
> few weeks.)

Are there other tests showing this?  I would expect at least one for
rdf:XMLLiteral and one showing that no integer matches any int.

> > 8/ Basic Definition of SPARQL
> > 
> > The definition of Solution Sequence is inadequately grounded.  A
> > Solution Sequence is defined as "a list of solutions, possibly
> > unordered" (Section 12.1.6).  The common formal definitions of lists
> > depend on an ordering.  If SPARQL is using some other definition, then
> > this other definition must be at least referenced.  The terminology used
> > to refer to solutions is much to varied.  It includes at least
> > sequences, lists, unordered collections, multisets, sets.
> > 
> > ToList "turns a multiset into a sequence, with the same elements and
> > cardinality" (Section 12.2.3).  Aside from the question about
> > cardinality of what, this is not a functional mapping, as there are many
> > sequences that could correspond to a multiset (or set) if the order of
> > the sequence is ignored.  The formal definition of ToList implicitly
> > mentions this non-functionality.
> > 
> > Given that ToList is a fundamental part of the definition of SPARQL it
> > requires a better definition.  Further, there needs to be proofs that
> > the choice in ToList does not make a difference anywhere in SPARQL, for
> > example, in further processing 
> 
> It is not intended to be functional. The specification leaves the order of 
> the sequence generated by ToList unspecified.

But there needs to be some demonstration that this does not affect the
results where order does matter.

> > The definition of SPARQL BGP mapping importantly depends on the order
> > that the RDF instance mapping and solution mapping are performed.  This
> > should be documented.
> 
> The Working Group does not believe this to be the case, but we'd welcome a 
> test case/example that demonstrates the difference.

Hmm.  I can't imagine what I was thinking when I claimed this.  Given
that instance maps map blank variables into terms and solution maps map
nodes into terms and blank nodes and variables are disjoint, then it is
obvious that they would commute.  Sorry.

> > The definition of BGP Matching is not specified in the document.  The
> > definition in Section 12.3.1 defines a "solution" reasonably, although
> > presumably mu is *the* "restriction of P to the query variables in BGP.
> > However, the last bit of the definition doesn't make sense?  What is
> > omega there?  What is mu there?  What is theta?  What is mu(theta)?
> > Where then is the definition of the match of a BGP against an RDF graph?
> 
> The last part of the definition says that the cardinality of a solution 
> (mu) in a multiset of solution mappings (omega) is the number of distinct 
> RDF instance mappings (sigma) that compose with mu to give a pattern 
> instance mapping (P) that, when applied to the BGP, produces a subset of 
> G.

Perhaps, but even this English gloss doesn't make sense to me.

Perhaps you want to say something like:

	The match of BGP against G, Omega_BGP,G, is the multiset of
	solution mappings where the cardinality of a solution mapping mu
	in Omega_BGP,G is the number of distinct RDF instance mappings
	sigma on the blank nodes of BGP such that P = mu ( sigma (BGP) )
	is a subgraph of G.

This definition cleans up a number of problems with the definition in
the document.  However, it should also go along with a change to the
definition of card just above, removing

	If mu is not a member of Omega, cardinality is not defined.

which doesn't make sense.


> > Section 12.5 does not provide the missing glue, as it just defers to
> > Section 12.3.1.  Section 12.5 doesn't even get to a BGP and an RDF
> > graph.
> > 
> > What do the [ ] and { } notations mean in Section 12.4?
> 
> { } is standard set notation.
> 
> The card[...] notation is introduced at the beginning of 12.3.

Actually I was referring to the [ ] notation for sequences.  There is an
incomplete definition of it in 12.4.  However, the use of [ ] doesn't
always follow the definition.  In any case it would be clearer to just
write it out, as in OrderBy(Theta,condition) is a sequence with the same
cardinality as Theta but ...

> > 9/ A Fundamental Disagreement on SPARQL
> > 
> > I still object to the fact that SPARQL can produce different results for
> > equivalent RDF graphs, as described in Section 12.3.2.
> 
> We have recorded your previous objection to this as part of our 
> rdfSemantics issue:
> 
>   http://www.w3.org/2001/sw/DataAccess/issues#rdfSemantics
> 
> The Working Group has considered alternatives; Bijan Paria described
> forms of leanness with respect to the data graph and the subgraph
> matching a query pattern.
>   http://www.w3.org/mid/DF0BA59F-7E8C-4CDB-BF2C-C391D05CEB4D@cs.man.ac.uk
> 
> The results of a SPARQL query with respect to a given data graph are
> defined, and specifically do not include leaning the matching
> subgraph. SPARQL neither prohibits nor requires the reduction of
> equivalent graphs to the minimal entailment
> 
> Most current implementations do not lean the input graph. The WG
> consensus was not to impose this on implementations, noting
> significant efficiency and scaling costs. Coupled with extension to
> other entailment systems, the fundamental problem is outside current
> practice. Implementations are free to process data before exposing it
> (e.g. apply leaning).


I still object to this decision.

Further, if the definition of SPARQL admits different results for graphs
that are simple-equivalent, then this should be so stated more
prominently, particularly as the document talks about simple entailment
so prominently.  It is misleading to talk about SPARQL being
based on simple entailment when it is not and such references should be
removed from the document.

 
Peter F. Patel-Schneider
Bell Labs Research
Received on Tuesday, 29 May 2007 16:43:09 UTC