Re: Draft response to: Re: major technical: blank nodes

This is a response to Pat Hayes's email in the archive
dated 26 Jan 2006 16:50:59 -0600.  Thank you for your
detailed comments.  They have been very helpful to me personally
in understanding the draft. 

The reason for my reply is that I believe we can do a better job
in our treatment of blank nodes in SPARQL.  I first came to an
earlier draft after reading the RDF Recommendations.  I found
the SPARQL draft very confusing and frustrating.  My essential
complaint was that SPARQL uses one term for two concepts:

a) RDF blank nodes, which are nodes in a graph with no label, and

b) SPARQL blank nodes, which are lexical tokens in a SPARQL query.

Pat Hayes's email rejects this interpretation.
However, let me give the reasons that I held it, based on my
reading of RDF and SPARQL both:

a) According to the "RDF Concepts and abstract syntax" Recommendation,
section 6.6 "Blank nodes", the set of RDF blank nodes is distinct
from the set of IRIs and "Otherwise, this set of blank nodes
is arbitrary.  RDF makes no reference to any internal structure
of blank nodes".  That is, RDF blank nodes have no label. 

b) The RDF Primer section 2.3 "Structured property values and blank
nodes" Figure 6 "Using a blank node" shows a blank node as having
no label.  It goes on to describe "blank node identifiers" of which
it says "...blank node identifiers are not considered to be actual
parts of the RDF graph." 

c) In our own working draft Section 2.5.3 "Example of basic graph
pattern matching" second sentence under the first box, it says
"The label information is not in the graph."

d) Section 2.8.3 "Blank nodes" says "Blank nodes have labels
which are scoped to the query".  However, RDF blank nodes have
no notion of scope (they simply exist, just as IRIs and literals
exist, with no notion of scope).  Scope is a lexical concept
(the portion of a query text in which an identifier has a single
referent).

My summary is that the consistent stance of the RDF Recommendations
is that blank node identifiers are an artefact of serialization. 

Now if a reader comes to the SPARQL draft with that model, he
finds it very confusing (certainly I did).  For example, section
2.4 talks about how to extend a pattern solution S to graph
patterns.  It says "If v is not in the domain of S, then S(v)
is defined to be v."  Applied to SPARQL blank nodes such as
_:a, this says S(_:a) is _:a.  Fine; it is still a lexical
token; there has been no mention of creating a blank node
corresponding to the label _:a.  As a result, the mapping of
a triple pattern, such as

  ?x :v _:a

is

  (S(?x), :v, _:a)

and there still is no RDF blank node. Consequently, the result
of the mapping is not an RDF triple.  Then we come to
setion 2.5.1 "General framework" and the definition of "basic
graph pattern E-matching".  This definition posits a basic
graph pattern BGP' and a scoping graph G' such that "G' and
BGP' do not share any blank node labels".  But how can they?
BGP' is a triple pattern and might contain SPARQL blank nodes;
G' is an RDF graph and as such does not contain anything that
can be called a blank node label at all (though serializations of G'
might).

After studying Pat Hayes's email, my conclusion is that the
text is using blank node identifiers as proxies or surrogates
for the blank nodes themselves. 

To clarify our text, my proposed resolution is as follows:

a) We should adopt the term "blank node identifier" for what I have
been calling SPARQL blank nodes.  This would harmonize with RDF
Recommendations, which use this term when talking about
character strings associated with blank nodes for identification
purposes.  For example, section 2.1.4 would be renamed "Syntax
for blank node identifiers".  We should scan the document for
other occurrences of "blank node", and, as appropriate, change to
"blank node identifier".

b) We state explicitly that for each distinct blank node identifier,
a distinct blank node is created for the purposes of processing
the query, different from any blank node in the graphs in the
query's dataset.  We can also say that the reader may wish to
think of the blank node identifiers as proxies or surrogates for
these created blank nodes.  Perhaps this might go in Section 2.1.4.

c) In section 2.1.8 "Result descriptions used in this document"
in the definition of RDF term, the created blank nodes  should be
explicitly listed as part of RDF-B.  (Note that even if one
believed that blank node identifiers were blank nodes all along,
this did not put them in RDF-B because they were not part of
any graph.)

d) In section 2.4 "Pattern solutions", definition of "pattern
solution", we say that the domain of S is extended to include
blank node identifiers by mapping each blank node identifier to
the blank node that was created for it in item b) above.

e) Somewhere we make the observation that the result of
applying a pattern solution S to a triple pattern is an RDf triple.
Thus if BGP is a basic graph pattern, then S(BGP)
is an RDF graph.

f) delete the definition of "basic graph pattern equivalence"
(changes proposed below make it dispensable).

g) delete the definition of "scoping graph", also unneeded.

h) Reword the definition of basic graph pattern matching
to use the notion of graph merge found in the RDF Recommendations.
The revised definition is something like this: "Given an
entailment regime E, a basic graph pattern BGP, an RDF graph
G and a pattern solution S whose range is a subset of B, then
BGP E-matches with pattern solution S on graph G with
respect to scoping set B if G E-entails the graph merge of G
and S(BGP)."  Actually, with the statement that the created
blank nodes are distinct from all blank nodes in the dataset,
a simple set union will suffice, though we may wish to stick
with the RDF notion of merge for consistency with RDF.

i) If we want to keep the technique of renaming blank node
identifiers, we move that outside the boxed definition into
explanatory text.  For example, "The graph merge referred to
in the preceding definition can be thought of as using
blank node identifiers as proxies for the blank nodes.  In that
case, care must be taken to ensure that the blank node identifiers
of G are different from all blank node identifiers in BGP.
Let G' and BGP' be serializations of G and BGP, respectively,
such that all blank node identifiers in G' are different from
all blank node identifiers in BGP'.   Then G' UNION BGP'
is the serialization of some graph G2.  S is a solution for BGP using
E entailment if G E-entails G2."

j) In section 2.5.2 "SPARQL basic graph pattern matching" last
paragraph, we can clarify that pattern solutions are unique,
not just unique up to blank node renaming.  The so-called "blank
node renaming" is an artefact of serialization.  The last sentence
is thus "the serialization of a set of all pattern solutions
is unique up to blank node identifiers".  We can also delete the phrase
"...possibly with blank nodes renamed" earlier in the paragraph,
because a pattern solution is not actually concerned with
assigning blank node identifiers.

Fred

Received on Thursday, 8 June 2006 23:32:01 UTC