Summary of my comments: close many, please add some to issue list from Fred Zemke on 2006-08-02 (public-rdf-dawg@w3.org from July to September 2006)

From: Fred Zemke <fred.zemke@oracle.com>
Date: Wed, 02 Aug 2006 12:24:46 -0700
To: public-rdf-dawg@w3.org
Message-ID: <44D0FBFE.7040200@oracle.com>
Master list of my comments

In the telecon on 1 Aug 2006, I was asked to state which of my
comments I regard as closed.  I also learned about the issue
list.  Following up on the telecon, I have assembled
all my comments in this message (a very long one, I'm afraid) and
interspersed remarks (beginning with ++) on subsequent developments,
in many cases stating that the matter is closed in my mind.

As for what is left, I would like to get the important issues
onto the issue list.  Here is my proposed additions to the
issues list:

1. Material on entailment and general framework needs to be
rewritten.  One objective of the rewrite is to extend the scope
of blank node identifiers to include FILTER clause in
rule [21] FilteredBasicGraphPattern.  

2. Should we rearrange rule [14] SolutionModifier to place
OffsetClause before LimitClause, given that the OffsetClause
is processed first.

3. Should items in the SELECT list be separated by commas.
I have heard that this is part of an existing issue on punctuation
which is being re-opened.

4. Duplicates from UNION: do we require a result sequence to
have a precise count of duplicates, or is it more lax?

5. The domain of solutions is not clearly specified. This is
particularly an issue for OPTIONAL and UNION.

6. Formal semantics of OPTIONAL is not clear.  The current
wording "if S is a pattern solution of A and of B otherwise
if S is a solution to A but not to B" appears to reduce
logically to just "S is a solution of A".  This issue is
likely to be handled at the same time as the issue on the
domain of solutions to OPTIONAL, though I'd like to list it as
a separate issue.

7. How does filter evaluation work if there is an unbound
variable that is not within a BOUND function?

8. There is no bridge from the syntax to the semantics.

In addition, reviewing
http://lists.w3.org/Archives/Public/public-rdf-dawg/2006AprJun/0129.html
I suggest one further editorial action, for rq24
5.4 "Basic graph patterns in SPARQL syntax", as follows:
This gives an example of a basic graph pattern
with a blank node identifier, and it says " with the scope of
the blank node label being the basic graph pattern".  What is
still missing is an example of the results of such a query,
and a contrasting example showing how the query behaves differently
when the triple patterns are placed into separate graph
patterns.  That is, I would like to see a contrast between
{ _:x :p >v . _x :q ?w } and { { _:x :p >v } { _x :q ?w } }
so the the reader can really appreciate what it means that
in the latter example the blank node labels are in different scopes.
Can this be handled editorially?

****************************************************

Recap of all my comments:


http://lists.w3.org/Archives/Public/public-rdf-dawg/2006AprJun/0114.html
editorial comments on CR dated 6 april 2006

Some (hopefully) editorial comments:

2.1.1 Syntax for IRIs
It would be helpful to have examples of the two abbreviated syntaxes.
One might think that abbreviated syntaxes must be written in angle
brackets.

++ solved in rq24 3.1.1 "Syntax for IRIs"


2.3 Triple patterns
The last sentence of this section says "This definition also
allows blank nodes in the predicate position."  But the
second component of a triple pattern is a member of I union V,
where I is the set of IRIs and V is the set of variables.
Thus it seems that SPARQL blank nodes are not permitted in
the second position.  Also the BNF for Verb says it may be
VarOrIRIref or 'a', so it seems that a SPARQL blank node can not
be a predicate.

++ solved; I don't see this sentence any more in rq24
4.2 "Triple patterns"


2.4 Pattern solutions
The first two sentences in the box use the following terms:
"variable solution", "substitution function", "pattern solution",
and "variable substitution".  How are these terms related?
Taken literally, it says that "a variable solution is a substitution
function" and a "pattern solution is a variable substitution".
Thus the first and second sentence seemingly have nothing to do
with one another. In that case, why is the first sentence, about
"variable solution", found in a box called "Definition: pattern
solution"?  This is very confusing.  The reader is left
suspecting that in fact all four terms are interchangeable.
Furthermore, scanning the rest of the document, one finds that
"variable solution" is never used at all, and the
term "solution" is frequently used without qualification (neither
as "variable solution" nor as "pattern solution"). We should use
our terminology consistently.

++ solved in rq24 4.3 "Pattern solutions", where
"variable substitution" replaces the term "variable solution"
that I complained about.  In addition I see the new sentence
that says that "solution" is short for "pattern solution".


2.5.1 General framework
Definition of "basic graph pattern E-matching", first bullet
talks about BGP and BGP' being "graph-equivalent".  This term
is not defined; instead "equivalent" is.  we should use our
terminology consitently.

++ not addressed yet in rq24, where the term is defined as
"equivalent" and used as "graph-equivalent".  There is a hot link
from "graph-equivalent" which takes you to "RDF Concepts" Rec.
Andy Seaborne in
http://lists.w3.org/Archives/Public/public-rdf-dawg/2006AprJun/0122.html
asks if that link satisfies my concern.  My answer is that
I think this link is incorrect because BGP and BGP' are not graphs,
they are basic graph patterns, and so the link should take you
to the definition of "equivalent" in this specification.
It is true that the definitions of "graph-equivalent" and
"equivalent" (for basic graph patterns) are very similar,
but they still have different domains.  Alternatively, the text
can be read as saying that the term "graph-equivalent" is being
extended so that its domain includes basic graph patterns.  That
would be fine, but in that case the definition should be reworded to
call it "graph-equivalent" rather than "equivalent", and the hot
link still needs to take the reader to the extended definition,
not the definition whose domain is only graphs.  Perhaps the
editor can make these changes editorially.
However, I think the expectation is that the entire section will
be rewritten, so I am not proposing that this be added to the
issue list.


9 Specifying RDF datasets
A graph is specified using an IRI, which can be a QName, but there
are no examples of using a QName to specify a graph.  This would
be helpful.

++ solved in rq24 9.3.3 "Restricting possible graph IRIs"


10.1 solution sequences and result forms
The first formal definition says that a "solution sequence" is
a "list".  Both of these terms imply ordering.  Then the last
sentence in the first box says "The solution sequence from
matching the query pattern is an unordered collection...".  
This is contradictory.  What we probably mean is that "The
solution sequence from matching the query pattern is in an
implementation-dependent order".  

++ solved in rq24 10.1 "Solution sequences and result forms"
by moving the sentence outside the box and rewording it.


10.1 Solution sequences and result forms
the last sentence in the first box says "The solution sequence
from matching the query pattern is an unordered collection...".
This sentence is not part of the definition
of "solution sequence" so it should be placed outside the box.

++ same solution as previous item.


10.1 Solution sequence and result forms
It would be helpful if the stages of processing solution sequences
were always mentioned in the same order.  The order is given as
ORDER BY, project, DISTINCT, LIMIT, OFFSET.  The text to be
rearranged is:

a) In the box for the definition of solution sequence modifier
(order modifier should be moved ahead of projection modifier)
b) The arrangement of subsections (10.1.3 ORDER BY should be
moved ahead of 10.1.1 Projection)

++ solved in rq24 10.1 "solution sequences and result forms"


A.5 Escape sequences in strings
It says that \U may only be used for Unicode code points in the range
U+10000 through U+10FFFF.  So the first two HEX digits must always be 00?
If so, why not show the syntax as '\U00' HEX HEX HEX HEX HEX HEX ?
A.6 "excape sequences in IRI references" - same comment.

++ not done in rq24, but this was only a suggestion so I don't
care.

*********************************************************

http://lists.w3.org/Archives/Public/public-rdf-dawg/2006AprJun/0115.html
Comments on 2.5.1 "General framework" in CR dated 6 April 2006

2.5.1 General framework
This section is difficult to read. All we are defining in the
current edition of SPARQL is with simple entailment, so why bother
to lay out a more general framework at all?  The only practical
consequence from this section seems to be that there is a scoping set which
is the set of RDF terms in the dataset.

If we must have a general framework,
perhaps we can dispense with defining entailment, contenting
ourselves with referencing [RDF-MT] for a discussion of entailment
in general.
 

2.5 Basic graph patterns
The definition of E-entailment regime, taken literally, says that
any subset of P(RDFG) x P(RDFG) is an entailment regime, where
P(RDFG) is the power set of RDFG, the set of all RDF graphs.
In contrast, reference [RDF-MT] "RDF Semantics" never actually
defines entailment, but I see two consequences from the discussion
there:

a) Entailment is a relation from a set of graphs to a single graph,
not to a set of graphs.  Thus we want a subset of P(RDFG) x RDFG.
It seems the text also has this in mind, since the definition of
"well-formed" contemplates that the range of E (ie, the projection
of E on its second component) is not a set of sets of graphs but
a set of graphs.

b) Entailment involves the notion of correct inferencing: given
a set of graphs S, they entail graph G if all interpretations of S
are interpretations of G.  Thus it is not true that just
any subset of P(RDFG) x RDFG constitutes an entailment.


2.5 Basic graph patterns
The definition of "well-formed for simple entailment" is difficult
to apply.  It seems that all RDF graphs are well-formed for simple
entailment.  For example, every graph G is entailed by the singleton
set { G }.  Therefore every RDF graph is in the range of simple
entailment.  Therefore every RDF graph is well-formed.  So what
is the point to this concept?

Looking ahead, it seems that the reason for introducing "well-formed"
is for the third bullet in the definition of "basic graph pattern
e-matching".  However, the fourth bullet of that definition already
implies the third bullet.  That is, if G E-entails (G' union S(BGP')),
then G' union S(BGP') must be in the range of the entailment E,
and so must be well-formed.  Hence it seems that we can dispense
with the notion of well-formed, since the only use of it is
superfluous.


2.5.1 General framework
After the definition of scoping set, it says "The scoping set may be
characterized differently by different entailment regimes".
I don't know what "characterized" means here.  Does it mean that
the entailment implies the scoping set?  So that simple entailment
uses one scoping set and RDF-entailment uses another scoping set?
If in fact the scoping set is not an independent parameter, then
the scoping set should be mentioned in the definition
of "E-entailment regime".


2.5.1 General framework
Definition of scoping graph says that "*the* scoping graph ... is
*an* RDF graph".  This does not make sense.  One cannot use
"the" to refer to something that is not pinned down uniquely.
There is not one graph that is uniquely graph-equivalent to G.
What we mean is "A scoping graph... is an RDF graph...".

After making this correction, the definition of scoping graph does
not imply the sentence immediately following the definition,
"The [sic, should be "a"] scoping graph makes the graph to be matched
independent of the chosen blank node names".  For example,
G is graph-equivalent to G, trivially, therefore G is a scoping
graph for itself.  But if G has a problem with blank node names,
then the scoping graph G also does, so we cannot conclude that
the scoping graph automatically solves this problem.  The best that
can be said that a suitably chosen scoping graph will make
the graph to be matched independent of the chosen blank node names.

Looking ahead to the use of the notion, it is currently treated as
a parameter in the definition of "basic graph pattern E-matching".
However, I doubt that one has a specific scoping graph in mind when
one does a basic graph pattern E-match.  That is, treating
"basic graph pattern E-match" as a boolean-valued function, you don't
want the scoping graph as one of the arguments to this function.

Proposed resolution: Delete the definition of "scoping graph", and
in the definition of "basic graph pattern E-matching", delete the
phrase "with scoping graph G'", replacing it with a new bullet
reading "there exists a graph G' that is graph-equivalent to G".



2.5.1 General framework
The pattern solution S is not being treated with the
same respect shown to the other parameters in the definition
of "basic graph pattern e-matching".   
It should be listed as one of the "givens" at the start of the
sentence.


2.5.1 General framework
In the definition of "Basic graph pattern E-matching", regarding
the first bullet:

a) it would be clearer if it were
reworded "There exists BGP' such that BGP' is a basic graph pattern
...".

b) The notion of graph-equivalent is not defined for basic graph
patterns, only for graphs.  Rather, the defined term is just
"equivalent".

All together, the first bullet should read
"There exists BGP' such that BGP' is a basic graph pattern that is
equivalent to BGP"


2.5.1 General framework
Summary of my proposed rewrite to the definition of "Basic graph pattern
E-matching":

Given:
- an entailment regime E,
- a basic graph pattern BGP,
- an RDF graph G,
- a pattern solution S, and
- a scoping set B
then BGP E-matches with pattern solution S on graph G with
respect to B if:
- There exists a basic graph pattern BGP' and a graph G' such that:
  + BGP' is equivalent to BGP
  + G' is graph-equivalent to G
  + G' and BGP' do not share any blank node labels
- G E-entails (G' union S(BGP'))
- The RDF terms introduced by S all occur in B

++ I don't believe these comments have been addressed in rq24.
This does not surprise or annoy me, since it seems that we are
waiting for the entailment experts, especially Pat Hayes, to
weigh in on a rewrite of the material on entailment and the
general framework.  My comments may be of interest to them;
in any case, I am waiting
to see that proposed rewrite.  In the meantime, I suggest that
the topic of rewriting the material on entailment and the
general framework should be added to the issues list.

******************************************************

http://lists.w3.org/Archives/Public/public-rdf-dawg/2006AprJun/0116.html
comment on CR 10.1 "solution sequences and result forms"

10.1 Solution sequences and result forms
The proposed order for processing LIMIT and OFFSET seems
counterintuitive.  Suppose the solution sequence has solutions
(S1, S2, ..., S9).  Suppose the user asks for LIMIT 3 and OFFSET 4.
With the current rules, LIMIT 3 will truncate the solution
sequence to (S1, S2, S3) and then the offset is greater than the
number of solutions so the final result is empty.  Instead,
in this scenario, I think the user expects to get (S5, S6, S7).
Thus the offset should be applied first, reducing the solution
sequence to (S5, S6, ..., S9), and then the LIMIT should be
applied.  From this standpoint, the BNF for SolutionModifier
should be rearranged to put OffsetClause ahead of LimitClause.

++ mostly solved in rq24 10.1 "Solution sequences and result forms"
The part that is not addressed is whether we should rearrange
rule [14] SolutionModifier to place OffsetClause? before
LimitClause? .  I suggest adding this to the issues list.

***********************************************************

http://lists.w3.org/Archives/Public/public-rdf-dawg/2006AprJun/0117.html
comment on A.7 "Grammar"

A.7 Grammar
There are no commas in the SELECT list.  This looks like a poor
design because it will be an obstacle to allowing arbitrary
expressions in the SELECT list in a future version.  I asked for
arbitrary expressions in a previous round of comments, and was told
that this was being deferred to the future.  That is fine, but I want
to insure that the ground is ready for that extension.

++ I believe this comment triggered reopening an issue on
punctuation.  

************************************************************

http://lists.w3.org/Archives/Public/public-rdf-dawg/2006AprJun/0128.html
Re: Draft response to: Re: major technical: blank nodes

This is a response to Pat Hayes's email in the archive
dated 26 Jan 2006 16:50:59 -0600.  Thank you for your
detailed comments.  They have been very helpful to me personally
in understanding the draft.

The reason for my reply is that I believe we can do a better job
in our treatment of blank nodes in SPARQL.  I first came to an
earlier draft after reading the RDF Recommendations.  I found
the SPARQL draft very confusing and frustrating.  My essential
complaint was that SPARQL uses one term for two concepts:

a) RDF blank nodes, which are nodes in a graph with no label, and

b) SPARQL blank nodes, which are lexical tokens in a SPARQL query.

Pat Hayes's email rejects this interpretation.
However, let me give the reasons that I held it, based on my
reading of RDF and SPARQL both:

a) According to the "RDF Concepts and abstract syntax" Recommendation,
section 6.6 "Blank nodes", the set of RDF blank nodes is distinct
from the set of IRIs and "Otherwise, this set of blank nodes
is arbitrary.  RDF makes no reference to any internal structure
of blank nodes".  That is, RDF blank nodes have no label.

b) The RDF Primer section 2.3 "Structured property values and blank
nodes" Figure 6 "Using a blank node" shows a blank node as having
no label.  It goes on to describe "blank node identifiers" of which
it says "...blank node identifiers are not considered to be actual
parts of the RDF graph."

c) In our own working draft Section 2.5.3 "Example of basic graph
pattern matching" second sentence under the first box, it says
"The label information is not in the graph."

d) Section 2.8.3 "Blank nodes" says "Blank nodes have labels
which are scoped to the query".  However, RDF blank nodes have
no notion of scope (they simply exist, just as IRIs and literals
exist, with no notion of scope).  Scope is a lexical concept
(the portion of a query text in which an identifier has a single
referent).

My summary is that the consistent stance of the RDF Recommendations
is that blank node identifiers are an artefact of serialization.

Now if a reader comes to the SPARQL draft with that model, he
finds it very confusing (certainly I did).  For example, section
2.4 talks about how to extend a pattern solution S to graph
patterns.  It says "If v is not in the domain of S, then S(v)
is defined to be v."  Applied to SPARQL blank nodes such as
_:a, this says S(_:a) is _:a.  Fine; it is still a lexical
token; there has been no mention of creating a blank node
corresponding to the label _:a.  As a result, the mapping of
a triple pattern, such as

  ?x :v _:a

is

  (S(?x), :v, _:a)

and there still is no RDF blank node. Consequently, the result
of the mapping is not an RDF triple.  Then we come to
setion 2.5.1 "General framework" and the definition of "basic
graph pattern E-matching".  This definition posits a basic
graph pattern BGP' and a scoping graph G' such that "G' and
BGP' do not share any blank node labels".  But how can they?
BGP' is a triple pattern and might contain SPARQL blank nodes;
G' is an RDF graph and as such does not contain anything that
can be called a blank node label at all (though serializations of G'
might).

After studying Pat Hayes's email, my conclusion is that the
text is using blank node identifiers as proxies or surrogates
for the blank nodes themselves.

To clarify our text, my proposed resolution is as follows:

a) We should adopt the term "blank node identifier" for what I have
been calling SPARQL blank nodes.  This would harmonize with RDF
Recommendations, which use this term when talking about
character strings associated with blank nodes for identification
purposes.  For example, section 2.1.4 would be renamed "Syntax
for blank node identifiers".  We should scan the document for
other occurrences of "blank node", and, as appropriate, change to
"blank node identifier".

b) We state explicitly that for each distinct blank node identifier,
a distinct blank node is created for the purposes of processing
the query, different from any blank node in the graphs in the
query's dataset.  We can also say that the reader may wish to
think of the blank node identifiers as proxies or surrogates for
these created blank nodes.  Perhaps this might go in Section 2.1.4.

c) In section 2.1.8 "Result descriptions used in this document"
in the definition of RDF term, the created blank nodes  should be
explicitly listed as part of RDF-B.  (Note that even if one
believed that blank node identifiers were blank nodes all along,
this did not put them in RDF-B because they were not part of
any graph.)

d) In section 2.4 "Pattern solutions", definition of "pattern
solution", we say that the domain of S is extended to include
blank node identifiers by mapping each blank node identifier to
the blank node that was created for it in item b) above.

e) Somewhere we make the observation that the result of
applying a pattern solution S to a triple pattern is an RDf triple.
Thus if BGP is a basic graph pattern, then S(BGP)
is an RDF graph.

f) delete the definition of "basic graph pattern equivalence"
(changes proposed below make it dispensable).

g) delete the definition of "scoping graph", also unneeded.

h) Reword the definition of basic graph pattern matching
to use the notion of graph merge found in the RDF Recommendations.
The revised definition is something like this: "Given an
entailment regime E, a basic graph pattern BGP, an RDF graph
G and a pattern solution S whose range is a subset of B, then
BGP E-matches with pattern solution S on graph G with
respect to scoping set B if G E-entails the graph merge of G
and S(BGP)."  Actually, with the statement that the created
blank nodes are distinct from all blank nodes in the dataset,
a simple set union will suffice, though we may wish to stick
with the RDF notion of merge for consistency with RDF.

i) If we want to keep the technique of renaming blank node
identifiers, we move that outside the boxed definition into
explanatory text.  For example, "The graph merge referred to
in the preceding definition can be thought of as using
blank node identifiers as proxies for the blank nodes.  In that
case, care must be taken to ensure that the blank node identifiers
of G are different from all blank node identifiers in BGP.
Let G' and BGP' be serializations of G and BGP, respectively,
such that all blank node identifiers in G' are different from
all blank node identifiers in BGP'.   Then G' UNION BGP'
is the serialization of some graph G2.  S is a solution for BGP using
E entailment if G E-entails G2."

j) In section 2.5.2 "SPARQL basic graph pattern matching" last
paragraph, we can clarify that pattern solutions are unique,
not just unique up to blank node renaming.  The so-called "blank
node renaming" is an artefact of serialization.  The last sentence
is thus "the serialization of a set of all pattern solutions
is unique up to blank node identifiers".  We can also delete the phrase
"...possibly with blank nodes renamed" earlier in the paragraph,
because a pattern solution is not actually concerned with
assigning blank node identifiers.

++ Pat Hayes responded in
http://lists.w3.org/Archives/Public/public-rdf-dawg/2006AprJun/0139.html
He disagreed with me in places, but agreed with my overall contention
that it would be good to distinguish clearly between blank nodes
in the query and blank nodes in the data.  I see that rq24
is already headed in this direction; possibly, this is solved, or
else it is just a matter of on-going editorial vigilance.
No issue proposed at this time.


***************************************************

http://lists.w3.org/Archives/Public/public-rdf-dawg/2006AprJun/0129.html
Comments on 2.1.4 "Syntax for blank nodes" in 6 April 2006 CR

2.1.4 Syntax for blank nodes
It says "Blank nodes...will take part in the pattern matching".
There are no examples of how pattern matching with SPARQL
blank nodes works.  Section 2.5.4 "Basic graph patterns in SPARQL
syntax" gives an example of the syntax only, but does not discuss
the semantics of the example that it presents.  Section 2.8.3
"Blank nodes" and 2.8.4 "RDF collections" show how the abbreviations
are expanded into SPARQL blank nodes, but do not show how the
expanded patterns behave either.

++ partially solved in rq24 5.4 "Basic graph patterns in the
SPARQL syntax".  This gives an example of a basic graph pattern
with a blank node identifier, and it says " with the scope of
the blank node label being the basic graph pattern".  What is
still missing is an example of the results of such a query,
and a contrasting example showing how the query behaves differently
when the triple patterns are placed into separate graph
patterns.  That is, I would like to see a contrast between
{ _:x :p >v . _x :q ?w } and { { _:x :p >v } { _x :q ?w } }
so the the reader can really appreciate what it means that
in the latter example the blank node labels are in different scopes.
Can this be handled editorially?

2.1.4 Syntax for blank nodes
The preceding section 2.1.3 "Syntax for variables" says
"Variables in SPARQL have global scope".  Section 2.8.3 "blank nodes"
says "Blank nodes have labels which are scoped to the query".  
It is not clear to me what the difference between the "global scope"
of a variable vs. being "scoped to the query" means in practice.
At any rate, I think it would be good to state the scope of blank
nodes (or what I prefer to call blank node identifiers)
in section 2.1.4, as a parallelism with 2.1.3, and possibly
harmonize the terminology.

++ solved in rq24 3.1.4 "Syntax for blank nodes"

************************************************

http://lists.w3.org/Archives/Public/public-rdf-dawg/2006AprJun/0130.html
Comments on UNION matching (CR 6 Apr 2006)

6. Matching alternatives
Second sentence says "If more than one of the alternatives
matches, all the possible pattern solutions will be found."
Does this mean that if a solution is a solution of both
patterns, then the solution occurs twice in the solution sequence?
There are no examples of solutions with multiple cardinality.
Such examples would be helpful.

++ not resolved; I don't see any examples in rq24 that address the
question of duplicates.  I am asking that the question of
duplicates be added to the issues list.


6.2 Union matching - formal definition
The definition is unclear about whether there are any constraints
on the value of a solution on a variable that
appears in one pattern but not in the other.  Example: what is
the result of

SELECT ?x ?y WHERE { FILTER (?x = ?x) } UNION { FILTER (?y = ?y) }

Suppose there is only one RDF term in the graph, <http:a>.
There are all together four partial functions from the
set of variables in the query {?x, ?y} and the set of RDF terms,
namely:
S1 (?x) = <http:a>, S1(?y) = <http:a>
S2 (?x) = <http:a>, S2(?y) undefined
S3 (?x) undefined,  S3 (?y) = <http:a>
S4 (?x) undefined,  S4 undefined

I believe that the desired set of solutions is {S2, S3},
i.e., S1 is not a solution of this query.  However, arguably,
S1 is a solution of FILTER (?x = ?x), and therefore belongs in
the result set according to the definition as written.

My proposed fix is: let P be pattern1 UNION pattern2.
Then S is a solution of P if either of the following is true:
1. S is a solution of pattern1 and S is undefined on every
variable that is contained in pattern2 but not in pattern1; or
2. S is a solution of pattern2 and S is undefined on every
variable that is contained in pattern1 but not in pattern2.

++ not addressed in rq24.  I hope to make a comprehensive proposal
on the formal semantics, so I am not agitating for this specific
solution.  I think the issue of the domain of solutions to
union patterns should be added to the issue list.


6.2 Union matching - formal definition
The definition is unclear about duplicates.  If s is a solution
of GP1 and S is a solution of GP2, does the solution sequence
contain a copy of S for each of GP1 and GP2?  I believe the
answer should be that duplicates are maintained because they
might be meaningful to the user; if the user wishes to eliminate
duplicates, the user can specify DISTINCT.  In that case, the
definition proposed in a separate comment needs to be rewritten
because it would eliminate duplicates.  I think the best
approach would be to recognize that the UNION operator is
constructing a solution sequence from the solution sequences of
each operand.  The proposed rewording is then:

Let P be pattern1 UNION pattern2.  Let V1 be the set of variables
that appear in pattern1 and let V2 be the set of variables that
appear in pattern2.  S  = (S1, S2, ... Sn) be a sequence of all
partial functions on V1 that are solutions of pattern1.  Let
T = (T1, T2, ... Tm) be a sequence of all
partial functions on V2 that are solutions of pattern2.
Then a solution sequence of P is any permutation of the sequence
(S1, ..., Sn, T1, ..., Tm).

(Note: This definition involves a trick concerning partial
functions.  For example, each Si is a partial function on V1,
therefore it is a partial function on the set of all variables
in P that happens to be undefined on the variables that belong
only to V2.)

++ not addressed in rq24.  I think the issue of the cardinality
of solutions to union patterns should be added to the issue
list.

************************************************

http://lists.w3.org/Archives/Public/public-rdf-dawg/2006AprJun/0132.html
Comments on optional pattern matching (CR 6 apr 2006)

5.1 Optional pattern matching
last sentence of the largest paragraph says "The whole graph
pattern of an optional graph pattern must match for the
optional graph pattern to add to the query solution."
The term "query solution" is not defined, but it also occurs
in 10.1.3 "ORDER BY" where it refers to the solution sequence.
In that case, the phrase "add to the query solution" presumbably
means to add another solution to the solution sequence. However,
that meaning does not always make sense here, because, given
a pattern P of the form Pattern1 OPTIONAL { Pattern2 }, the
cardinality of the solution sequence of P can be the same as the
cardinality of the solution sequence of Pattern1, if
for every match to Pattern1, either Pattern2 has a unique match
or no match.

Instead of using the verb "add", may I suggest the verb "widen"?
It is true that OPTIONAL may increase the number of results,
but the primary role of OPTIONAL is to widen the results with
additional variables and SPARQL blank nodes found in the second
pattern.

++ solved in rq24 7.1 "Optional pattern matching", where the
wording has been changed to say "affect the query solution".  

 

5.4 Optional matching - formal definition
The definition is problematic because it uses two terms,
"pattern solution" and "solution".  It is not clear whether
these are distinct concepts or the same concept.

I believe most readers will think they are the same concept.
In that case, the definition does not work because it is logically
equivalent to "S is a solution of optional graph pattern if S is a
pattern solution of A".  Proof: Let S be a pattern solution of A.
Now either S is a pattern solution of B or not.
If S is a pattern solution of B, then S is a pattern solution
of A and B and meets the criterion.  If S is not a pattern
solution of B, then S is a pattern solution of A but not of B, so
S meets the "otherwise" part of the criterion.  Thus in either
case S satisfies the criterion.

If "pattern solution" and "solution" are separate concepts,
then we need to be more explicit about the distinction,
preferably by coming up with a two-word phrase for the latter
concept.

If this is our intent, it still does not rescue the definition
logically.  A close reading of the actual words shows that
we are defining "solution" as a recursive definition built
on "pattern solution".  For S to be a "solution" of Opt(A,B),
then the first possibility is that S is a "pattern solution"
of A and B.  However, the intent is that B might itself be
an optional graph pattern, and this close reading of the
definition leaves the notion of being a "pattern solution" of
an optional graph pattern undefined, so the recursion breaks
down when defining Opt (A, Opt (B, C)).

In addition, the definition does not work because the definition
of a pattern solution is a total function whose domain is all
variables.  For example, consider the query
{ ?x ?y ?z . OPTIONAL { ?x ?z ?w } } applied to the graph
with a single triple G = { (ex:a, ex:b, ex:c) }.
Here is a trial solution:
S(x) = ex:a, S(y) = ex:b, S(z) = ex:c, S(w) = ex:a.
Note that I have deliberately defined S as a total function on all
variables in the query.  Now let's try to apply the definition of
optional matching to this trial solution.  We see that
S is a solution for ?x ?y ?z and S is not a solution for ?x ?z ?w.
Thus it would seem that S qualifies as a solution because it
satisfies "S is a solution to A but not to A and B".
However, I don't think S should be regarded as a solution to
the query.  Instead, I think that when S fails as a solution to B,
then S should be undefined on any variables that occur only in B.
Thus S2 defined by S2(x) = ex:a, S2(y) = ex;b, S2(z) = ex:c, and
S2(w) undefined should be a solution to the example.

My tentative fix is:

given syntax pattern1 OPTIONAL { pattern2 }, call pattern1 the
mandatory pattern and pattern2 the optional pattern.  Any variable
that appears in the mandatory pattern is called a mandatory variable.
Any variable that appears in the optional pattern and not in the
mandatory pattern is called an optional variable.  A pattern solution S is
a partial function from the variables in the query to RDF terms
such that the following hold:

a) S restricted to the mandatory variables is a pattern solution of
pattern1.

b) One of the following two cases is true:
   i) S is undefined on all optional variables, or
   ii) S is a pattern solution of pattern2

++ my tentative fix was shown not to handle a desired capability.
I am still working on the best way to formulate this.  In the mean
time, I request that the formal semantics of optional graph
patterns should be added to the issue list.


**************************************************

http://lists.w3.org/Archives/Public/public-rdf-dawg/2006AprJun/0136.html
Comments on the proper domain for solutions (CR 6 Apr 2006)

2.4 Pattern solutions
The definition defines "variable solution" as a partial function and
"pattern solution" as a total function.  Since the heading on the
box calls out "pattern solution", and only the definition of
"pattern solution" is in bold, and "variable solution" never appears
elsewhere in the document, the reader is led to believe that the
important notion is "pattern solution", the total function.
However, focusing on total functions is a mistake, as shown by
the OPTIONAL and UNION syntaces, which explicitly require partial
functions as solutions to patterns.

++ not resolved; rq24 4.3 "pattern solutions" still says that the
domain of a pattern solution is V, the set of all variables.
I am asking to add an issue on the domain of solutions,
which pertains here, as well as other patterns.


2.4 Pattern solutions
It says that a pattern solution is a total function on V, an infinite
set.  As pointed out in a separate comment, solutions in general
are partial functions on V, when OPTIONAL and UNION are considered.
The issue to be raised in this comment is that V is not the
appropriate domain, even in the case of matching a triple pattern.
Consider SELECT ?a ?b ?c WHERE { ?a ?b ?c } evaluated on a graph
containing a single triple, (n:s n:v n:o).
Then, according to the definition, a pattern solution is a total
function mapping F:V -> {n:s, n:v, n:o} such that F(?a) = n:s,
F(?b) = n:v, F(?c) = n:o, and F(v) is unconstrained for all other
variables v.  There are a countable infinity of these total
functions.  Now assemble these pattern solutions into a solution
sequence and project to retain just the variables ?a, ?b and ?c.
You still have an infinite sequence.  Thus the result of the query
appears to be an infinite sequence (whose every member is the
same function on { ?a, ?b, ?c }).

If one replies that the projection is to be done before assembling
the solution sequence, I observe first that that is not the
description in Section 10.1 "solution sequences and result forms",
but more importantly, that will produce the wrong cardinality on
other queries, for example, SELECT ?a ?b WHERE { ?a ?b ?c }
on a dataset with two triples (n:s n:v, n:o1), (n:s, n:v, n:o2).
In this example, projecting before assembling the solution
sequence will result in only a single solution, whereas there
should be two.

Instead, I believe that the correct algorithm is to look at
pattern solutions whose domain is the set of variables
that appear in the specific query to be evaluated (not the infinite
set V).  And, as pointed out in a separate comment, the focus
should be on partial functions rather than total functions.

++ not resolved.  I request an issue on domain of solutions.

*********************************************************

http://lists.w3.org/Archives/Public/public-rdf-dawg/2006AprJun/0135.html
Comment on 11.2 "filter evaluation" (CR 6 Apr 2006)

11.2 Filter evaluation
The rules do not say how to handle an unbound variable
in an expression.  Clearly it must be possible
for the argument of BOUND to be unbound.  I believe the desired
semantics are that it is an error to have an unbound argument
for any other function.  This should be stated explicitly.
Also, note that the first bullet says
"SPARQL functions do not process node sequences.  When interpreting
the semantics of XPath functions assume that each argument is
a sequence of a single node".  Given that an argument might be
unbound, it is not true that arguments are always a sequence
of length 1; I think the truth is that the argument may be a
sequence of length 0 or 1.

++ Andy pointed me to
http://lists.w3.org/Archives/Public/public-rdf-dawg/2006JulSep/0012.html
evidently someone else thinks that this could be specified
better.  I request an issue for this.

***********************************************************

http://lists.w3.org/Archives/Public/public-rdf-dawg/2006AprJun/0137.html
Comments on 9 "Specifying RDF datasets" (CR 6 Apr 2006)

9 Specifying RDF datasets
The text could be clearer about where the default graph comes from.
The first paragraph hints at this, but the description is cloudy
because of the use of "may".  Thus the first sentence says
"A SPARQL query may specify the dataset
...".  The use of "may" in this sentence evidently refers to the
user, who "may" choose to use the FROM clause.  Third sentence
"The RDF dataset may also be specified in a SPARQL protocol
request".  Does this mean that it is the user's responsibility
to use at least one of these two techniques?  What happens if the
user uses neither?  Is this an error?  Or perhaps this would mean
the default graph is empty?  Or does some implementation-defined
default kick in?  The precise rules for determining the default
graph need to be specified.

++ I am satisfied with Andy's reply in  
http://lists.w3.org/Archives/Public/public-rdf-dawg/2006JulSep/0029.html


9 Specifying RDF datasets
There is no statement of the formal semantics of the FROM clauses.

++ I am not satisfied with Andy's reply in  
http://lists.w3.org/Archives/Public/public-rdf-dawg/2006JulSep/0029.html
My reasons are:

1. There is no bridge from the syntax rule [9] DatsetClause
to the abstract construct called a dataset in rq24 9.3
"Querying the dataset".   This is an instance of my general
issue that there is no bridge from syntax to semantics.

2. Andy says that it is implementation-defined whether to merge
graphs to produce the default graph if there are multiple
FROM clauses.  However, the last sentence of rq24 9.2.1
"Specifying the default graph" says that the default graph is
the merge.  If Andy is correct, then that sentence needs to
change.  I am agreeable to an editorial resolution of this.

*************************************************************

http://lists.w3.org/Archives/Public/public-rdf-dawg/2006AprJun/0131.html
Comments on 10.1 "Solution sequences and result forms" (CR 6 Apr 2006)

10.1.1 Projection
The last sentence of the formal definition uses set notation
for the result of projecting a solution sequence into a new
solution sequence.  This is not desired, because:
a) sets are not ordered, but solution sequences are
b) sets do not permit duplicates, but the intent is that the
result of a projection might have duplicates.

This can be corrected by using some notation denoting a sequence.
Earlier we used (S1, ..., Sn) to denote a sequence, and that
could be done here, for example,
( (project (S1, VS), ... project (Sn, VS) ).
Or we can use the mathematical definition of a sequence as a
function whose domain is the positive integers, in which case
the sequence is represented { (i, project (Si, SV) ) | i = 1, ..., n }

++ solved in rq24 10.1.2 "Projection"


10.1.3 ORDER BY
The formal definition does not support the following features:
a) ordering in descending order
b) ordering by multiple sort keys.

++ Andy replied in
http://lists.w3.org/Archives/Public/public-rdf-dawg/2006JulSep/0027.html
He noted that he had made an editorial improvement to the first
sentence of rq24 10.1.1 "ORDER BY".  I agree that the editorial
change is an improvment, and perhaps that is all that can be done
editorially, but I still feel that there is an unspecified gap
between the syntax and the semantics.  How does one know that
one uses the increasing order with ASC and reverses that order
with DESC?  How does one know to do the lexicographic
order when there are multiple sort keys?  These are not weighty
issues, but other specifications have done this, and so can we.
I regard this as part of my general issue of connecting the dots
from the syntax to the semantics.

**********************************************************

http://lists.w3.org/Archives/Public/public-rdf-dawg/2006AprJun/0134.html
Comments on 2.6 "Multiple Matches" (CR 6 Apr 2006)

2.6 Multiple matches
First sentence: "The results of a query is the set of all pattern
solutions that match the query pattern, giving all the ways a query
can match the graph queried."  But sets eliminate duplicates,
and we have the DISTINCT operator as an optional syntactic choice
about whether to eliminate duplicates.  Instead, this should say
that the result of a query is a sequence of solutions.  See
section 10.1 "Solution sequences and result forms".

++ solved in rq24 2.2 "Multiple matches"


2.6 Multiple matches
The semantics of the empty graph pattern has not been defined.
I think the following queries are instructive:

a) SELECT ?a
   FROM graph
   WHERE { }

b) SELECT ?a ?b
   FROM graph
   WHERE { }

c) SELECT ?a ?b
   FROM graph
   WHERE { ?a foaf:verb foaf:noun }

d) SELECT ?a ?b
   FROM graph
   WHERE { ?a foaf:verb foaf:noun .
   OPTIONAL { ?a foaf:verb2 ?b } }

One's initial impulse is that query a) should result in the set of
all mappings of { ?a } to the scoping set (not the set of all
total mappings of V to the scoping set; see related comment).
Or equivalently, the user might view the result as an enumeration
of the scoping set of the graph.

Then query b) would result in the set of all mappings of { ?a, ?b }
to the scoping set, or, naively, the cross product of the scoping
set with itself.

However, I believe that c) and d) should result in a subset of the
result of b).  Now in the case of d) in particular, OPTIONAL is
intended to allow for a result which is a partial binding, ie, one
that binds ?a but does not bind ?b.  If it happens that there is
no binding for ?b, then the result would not be a subset of the
cross product of the scoping set with itself.

My conclusion is that in order to support OPTIONAL and UNION, we
have to permit a result that is a partial mapping.

Coming back to query b), in order for it to contain query d) as a
subset, the result of b) must be all partial functions from {?a, ?b}
to the scoping set.  Alterantively, a naive view might
imagine augmenting the scoping set with a single "missing" element,
distinct from all other elements, in which case the result of b)
is the cross product of the augmented scoping set.

And as for a), it seems the result must be the set of all partial
functions of { ?a } to the scoping set, or equivalently, an
enumeration of the augmented scoping set.

As a different approach to this issue, consider these two queries:

a1) SELECT ?a
    FROM graph
    WHERE { BOUND (?a) }

a2) SELECT ?a
    FROM graph
    WHERE { !BOUND (?a) }

I believe the following things:
-- the result of a) should be the union of the result of a1) and a2)
-- the result of a1 should be an enumeration of the scoping set,
-- the result of a2 should be a single solution, in which ?a is
not bound.

++ solved in rq24 6.2 "Empty graph pattern"

****************************************************

http://lists.w3.org/Archives/Public/public-rdf-dawg/2006AprJun/0133.html
General problem with cardinality of results in CR 6 April 2006

The specification is imprecise about the cardinality of solutions
in several places.  I have heard it argued that all that matters is
the set of solutions (ie, duplicates can be dropped as an
implementation dependent or defined detail).  The problem with that
position is that the number of duplicates may be semantically
meaninful to the user.

This issue will become acute when we add aggregates, which has been
deferred to a future release.  When a user performs a count or sum,
the user expects precise semantics about the number of duplicates
to be counted or summed.  If we fail to specify those semantics in
this edition, then we will encourage differing implementations, which
then will have entrenched positions on cardinality issues when we try
to add the aggregates.  Therefore it is important to resolve the
cardinality issues at this stage.

Even in advance of adding aggregates, the user may be interested
in computing them on his own using the results of a SPARQL query.
To assure portable and interoperable results, we need to define
the number of duplicates precisely.

I see at least the following issues on cardinality:

a) the number of solutions to the empty pattern,
SELECT ?A WHERE {}.  I can see arguments for 0 solutions,
1 solution (the one that makes ?A undefined), n solutions where
n is the cardinality of the scoping set (one for each possible
binding of ?A) or n+1 (one for each possible binding, plus one in
which ?A is not bound).

++ solved in rq24 6.2 "empty graph pattern".

b) the number of solutions to a UNION, for example,
SELECT ?A WHERE { { ?A ?A ?A } UNION { ?A ?A ?A } }

c) the number of solutions when a triple pattern includes a blank
node. For example SELECT ?A WHERE { ?A n:v _:B } on the graph
G = { (n:a, n:v, "1"), (n:a, n:v, "2") }.
Does this have one solution or two?

Argument in favor of one solution:
there is only one mapping S from the set of variables to the set of
RDF terms such that S( ?A n:v _:B ) is a triple that can be merged
into G to produce a new graph that is simply entailed by G.  This is
using the definition of basic graph pattern E-matching in section
2.5.1 "General framework".

Argument in favor of two solutions: Section 2.5.2 "SPARQL basic
graph pattern matching" last paragraph says that under simple
entailment, pattern matching can be done by mapping both variables
and SPARQL blank nodes to RDF terms, testing to see if the result of
the mapping is a subgraph of G.  Under this formulation, there are
two solutions, one that maps _:B to "1" and the other that maps _:B
to "2".  This paragraph goes on to say that solutions are formed
by restricting such mappings to just the set of variables.  What
is unclear is whether the act of restricting involves discarding
duplicates.  Note that the fact that there is a DISTINCT modifier
shows that one can not presume that duplicates are discarded.

I am posting separate, more detailed comments on specific sections
with cardinality issues.

++ I think we need an issue on cardinality of UNION.
As for cardinality of a basic graph pattern with a blank node
identifier, my current interpretation of the document is that
the general framework is normative and therefore the cardinality
must be 1.


************************************************************

http://lists.w3.org/Archives/Public/public-rdf-dawg/2006AprJun/0141.html
Comment on 4.1 "Group graph patterns" (CR 6 Apr 2006)


4.1 Group graph patterns
The formal semantics of group graph patterns
does not work in conjunction with the
definiton of basic graph pattern matching (section 2.5.1 "General
framework").  For example, let G be the graph

  <s> <v1> <o1> . <s> <v2> <o2> .

Consider the pattern

  { ?x <v1> _:a } { ?x <v2> _:a }

The pattern is a group graph pattern consisting of two triple
patterns.  According to the definition, we are looking for a
solution S that is a solution of { ?x <v1> _:a } and a
solution of { ?x <v2> _:a }.  As a trial solution, consider the
function S that maps ?x to <s>.  I claim that S is a solution of
both subpatterns. For the first pattern, according to the
definition of basic graph pattern E-matching the question is
whether the following graph

  <s> <v1> <o1> . <s> <v2> <o2> . <s> <v1> _:a .

is entailed by G.  The answer is yes.  Similarly, s is a solution
of the second pattern as well.

The problem in logical terms is that
"for all x there exists y such that P(x, y)" is not as strong
an assertion as "there exists y such that for all x, P(x, y)".
The current definition only supports the weaker assertion
"for all there exist" rather than the desired assertion
"there exists for all".  Mathematicians generally use the term
"uniform" to describe "for all there exists" situations
(for example, the definition of uniform continuity).

On the other hand, if one uses the alternative definition of
triple matching in section 2.5.2 "SPARQL basic graph pattern
matching", the uniform treatment of blank node identifiers
in graph patterns is assured.  This definition says that
a solution is found by mapping variables and blank node identifiers
to RDF terms.  In that case the trial solution S must specify
which node in G the blank node identifier _:a is bound to.
Since there is no single choice that works, there is no solution
to the pattern.

++ This comment as worded above is mistaken.  It assumes that
using the mapping algorithm for basic pattern matching makes
the scope of a blank node identifier be the entire query.
However, rereading rq24 5.2 "SPARQL basic graph pattern matching",
I see that the scope is still just a basic graph pattern.

Whether this is advisable from a usability standpoint is a
separate question.  I have not decided whether to raise that
as an issue.

***************************************************

http://lists.w3.org/Archives/Public/public-rdf-dawg/2006AprJun/0189.html
Blank node identifiers in FILTER clauses

The scope of blank node identifiers is not clearly specified.
However, as I have understood conversations in email and
telecon, the definition of basic graph pattern E-matching in
2.5.1 "General framework" provides the only definition for the
semantics of blank node identifiers, and therefore
the scope of a blank node identifier
is a basic graph pattern.  My question is whether the scope
can also extend into a Constraint in a FilteredBasicGraphPattern.

For example, consider the data set with these three triples:

<s1> <v> <o1> .
<s2> <v> <o2a> .
<s2> <v> <o2b> .

The user wants to find those subjects which are related via the
verb <v> to at least two objects.  The desired solution
sequence is { <s2> }.  The user writes his query this way:

SELECT ?x
WHERE { ?x <v> _:a . ?x <v> _:b . FILTER (_:a != _:b) }

Does this do what the user wants?

It seems that the definitions in 2.5 "Basic graph patterns"
only explain how to solve the basic graph pattern

?x <v> _:a . ?x <v> _:b .

The solutions of this basic graph pattern are ?x = <s1>
and ?x = <s2>.  In the case of ?x = <s1>, this is because
the dataset entails the addition of these triples:

<s1> <v> _:a .
<s1> <v> _:b .

or in predicate calculus terms, it is possible to conclude
from the dataset that

(exists _:a, _:b) [ <s1> <v> _:a . <s1> <v> _b . ]

Or using the mapping technique for simple entailment, map
?x -> <s1>, _:a -> <o1>, _:b -> <o1> and then restrict to
just the mapping of ?x.

Note that the definitions of section 2.5, using either
entailment or mapping, do not provide for evaluating a
Constraint during the process of finding solutions to a
basic graph pattern.

So both solutions ?x -> <s1> and ?x -> <s2> come to the
FILTER clause, and the FILTER clause is unaware of any bindings
to _:a or _:b.  I do not know whether the result of
FILTER (_:a != _:b) is true, false or error, but whatever
the semantics of the FILTER clause is, it appears that it will
treat the two solutions identically.  If true, then both
<s1> and <s2> are solutions; if false or error, then neither
are.  Thus the solution set appears to be either { <s1>, <s2> }
or the empty set.  Not what was desired!

I see four possible resolutions:

1. (My preference) the scope of a blank node identifier is
an entire FilteredBasicGraphPattern, not just a basic graph
pattern.  To do this, we need to extend the definitions in
section 2.5 so that they define the solutions of a
FilteredBasicGraphPattern rather than just the solutions of a
basic graph pattern.  I can see how to do this with the
simple entailment mapping definition; I don't see how to do
this with the general E-entailment definition.

2. We prohibit blank node identifiers in FILTER clauses as
inherently meaningless or deceptive syntax.

3. We allow blank node identifiers in FILTER clauses, but they
always raise an error, so that such FILTERs always fail.
But in that case, why did we permit the syntax?

4. We allow blank node identifiers in FILTER clauses, and
they reference distinct blank nodes, distinct from all blank
nodes in the dataset.  Thus _:a = _:b is false, and _:a != _:b
is true.

++ this comment started an extended dialog.  My summary is that
it seems there is a consensus for choice 1.  I suggest that
we add this to the issue list.  Presumably it will be resolved
by the anticipated rewrite of the general framework.


*****************************************************

http://lists.w3.org/Archives/Public/public-rdf-dawg/2006JulSep/0008.html
Concrete vs. existential semantics

I have been advocating for strict definitions of the number of
rows returned by queries.  As I understand it, Andy Seaborne
has advocated an opposite view, that SPARQL should not define
precisely how many duplicates are returned by a query.
For example, in
http://lists.w3.org/Archives/Public/public-rdf-dawg/2006JulSep/0005.html
"In general, it isn't possible to conclude anything about numbers
of things in RDF.  It is in OWL."
I have also heard the opinion that it does not matter whether
duplicates are eliminated from a UNION or not; I don't have
a name or message to cite for that opinion.  More generally,
I think there is an opinion that all SPARQL cares about is
that the result sequence, after eliminating duplicates, is correct.
Thus the result of a SELECT is not precisely defined;
only SELECT DISTINCT is.

In this message I want to start a discussion on this.
As an initial foray, I will frame the question in terms
of "concrete" vs. "existential" semantics.

I grant that it is difficult to impossible to be sure that
two seemingly-different IRIs refer to distinct things.
I also grant that it is difficult to impossible to be sure that
two seemingly distinct blank nodes, conceived of as existentials,
are known to be distinct. However, I wonder whether it is a
good idea to base our semantics exclusively on these "existential"
insights.

I think the naive view is that two things are distinct if they
look distinct.  Two IRIs that are spelled differently are
different.  Two blank nodes with different node identity
are different.  (Blank node identifiers are proxies for node
identity; two blank nodes with different identifiers are
different).

I think that in many instances, the users will want this kind of
concrete interpretation of an RDF graph.  Further, I believe
that when one is working with a concrete interpretation,
duplicates may carry semantic meaning and it is important to
define precisely how many duplicates are returned. I especially
believe this is true when there are financial figures involved.

For example, imagine a purchase order encoded in RDF.
Each purchase order has an IRI.  Various facts about the PO
are assembled using verbs: bill-to, ship-to, and the line items.
Since bill-to, ship-to and line items are all compound objects,
they may be represented by blank nodes, which in turn connect
via various verbs to literals or IRIs.  Let's look at the line items
in particular.  A line item consists of a part number (an IRI),
a quantity (an xsd:integer), and a unit price (an xsd:decimal).
The user wants to find the total price of a particular PO.
The query looks something like this:

SELECT ?quantity ?price
WHERE some:IRI po:po _:lineitem .
      _:lineitem po:quantity ?quantity .
      _:lineitem po:price ?price .

Since SPARQL has no aggregates or expressions in its SELECT
list, the user intends to simply fetch all rows, multiply
?quantity * ?price and take the sum himself.

Now it can happen in a PO that the quantity and price of
two line items are identical.  However, suppressing such
duplicates would be fatal to this application.

Note that adding the part number to the SELECT list will
not necessarily save the query, since the combination of
part number, quantity and price is still not a guaranteed
unique key for line items.  The user is relying on distinct
blank nodes to represent distinct line items.

Of course, from the point of view of "RDF Semantics"
that would be a redundant graph, for example, one that
asserts "There exists a line item whose part is XYZ,
quantity is 1 and price is 10.99" and asserts again
"There exists a line item whose part is XYZ, quantity is 1 and
price is 10.99".  Thus one could say that this is a misuse
of RDF.  This may be technically true, but I wonder if insisting
on this point will really serve the users.  If you read the RDF
Primer, the application design above makes sense. You have a line
item; you don't want to bother creating an IRI for each line
item; so you make a blank node for each line item.
"RDF Semantics", on the other hand, is a dense document
with talk about hypothetical universes that are interpretations
of a graph.  This is not the kind of material that will make
its way into seminars, courses, how-to books, etc.

The early days of relational databases encountered the same
problem.  The theorists said a relational table is a set,
therefore it can have no duplicates, therefore it is up to
the user to insert some additional piece of information to
distinguish two otherwise-identical line items, to provide a
unique key.  Sounds great in theory; however, the vendors
discovered that they had to accomodate the naive view that
each row has its own identity and is distinct, without requiring
a unique key.

A slightly different response is that RDF and SPARQL are not
targeted at such applications.  However, the introduction to
"OWL web ontology language guide" poses this scenario:
"consider actually assigning a software agent the task of
making a coherent set of travel arrangements."  If eventually
RDF databases and SPARQL queries are part of such a software
agent, then it will be necessary to make concrete assurances
about the total price of a travel plan.  In addition, the vision
is that the dataset will be aggregated from many sites, which
means that there will not be a central authority to impose
strict existential semantics.

My suggestion is that we consider some syntactic way to
support both a "concrete" interpretation and an "existential"
interpretation.

My tentative initial solution is a three-way switch:
SELECT DISTINCT, SELECT ALL and SELECT LAX.  
SELECT DISTINCT promises to remove duplicates,
SELECT ALL promises to deliver all duplicates,
and SELECT LAX makes no promises either way.  (Anyone have
a better keyword for this choice?)

I don't believe this is the complete solution to the issue.
The reason is that the issue of duplicates becomes more
complicated when using OWL entailment.  OWL permits the
deduction that two seemingly distinct IRIs or blank nodes
are in fact equal.  For example, if the reasoner can deduce
that some:IRI1 = some:IRI2, what should the reasoner return
for SELECT ALL?  Does it return both even though it knows they
are equal?  If not, how does the user frame a query to ask
for all synonyms of some:IRI1?  What should the
reasoner return for SELECT DISTINCT?  Does it pick one of the
two arbitrarily?

++ my comment was vigorously rebutted and I did not receive
any messages in support of my comment.  I understood the
replies and I can see the opposing point of view.  I continue to be
concerned that we may lose the confidence of our users if
SPARQL cannot be relied on to return results that can be
meaningfully counted and summed.  However I am not pushing to
add this to the issue list at this time.

*******************************************************

http://lists.w3.org/Archives/Public/public-rdf-dawg/2006JulSep/0042.html
ready to exit CR?

I am strongly opposed to exiting CR because of the issues
I have raised with the specification, which I regard as serious
and fatal.  In my view the purpose of a specification is to specify.
Examples do not constitute a normative specification.
The document (both the CR and rq24) fails to specify in the following
important ways:

1. There is no bridge from the concrete syntax to the abstract
semantics.  Consequently the document can not actually be said
to specify the language at all, except that A.7 "Grammar" really
does specify the syntax.

2. The scope of blank node identifiers has not been stated clearly.
The consensus in an email thread appears to be that the scope
is a FilteredBasicGraphPattern (rule [21]) but the definitions in
2.5.1 "General framework" do not support this and need to be
rewritten.

3. The abstract semantics does not pay attention to the critical
issue of the domain of solutions.  Consequently the notion of
"solution" is not well-defined.

4. The preceding problems are perhaps at their worst in the
case of optional graph patterns.  The grammar does not indicate
what the first operand of a graph pattern is, and there is no
discursive text on the subject either.  Thus there is no bridge from
the syntax to the abstract semantics.  As for the abstract semantics,
the definition of OPT(A,B) appears to reduce to just solving A
with no role for B.

5. It is not clear whether UNION requires an implementation to
count duplicate solutions precisely, which I personally advocate,
though I could live with the alternative of stating explicitly that
it is implementation-defined or -dependent how many duplicates
are returned.

++ this message recapitulates some of my comments and indicates
some that I feel are obstacles to progression.  These issues are
all enumerated earlier in this file and no additional issues
need to be created.

Fred
Received on Wednesday, 2 August 2006 19:25:55 UTC