Re: Draft response to: Re: major technical: blank nodes from Pat Hayes on 2006-06-09 (public-rdf-dawg@w3.org from April to June 2006)

From: Pat Hayes <phayes@ihmc.us>
Date: Fri, 9 Jun 2006 11:39:56 -0500
To: Fred Zemke <fred.zemke@oracle.com>
Cc: public-rdf-dawg@w3.org
Message-Id: <p06230902c0af455804a0@[10.100.0.24]>
>This is a response to Pat Hayes's email in the archive
>dated 26 Jan 2006 16:50:59 -0600.  Thank you for your
>detailed comments.  They have been very helpful to me personally
>in understanding the draft.
>The reason for my reply is that I believe we can do a better job
>in our treatment of blank nodes in SPARQL.

I honestly don't think we can, given the many constraints we need to 
satisfy. What we can do a better job of, however, is *explaining* the 
treatment.

>   I first came to an
>earlier draft after reading the RDF Recommendations.  I found
>the SPARQL draft very confusing and frustrating.  My essential
>complaint was that SPARQL uses one term for two concepts:
>
>a) RDF blank nodes, which are nodes in a graph with no label, and
>
>b) SPARQL blank nodes, which are lexical tokens in a SPARQL query.
>
>Pat Hayes's email rejects this interpretation.

Well, it wasn't the one I had in mind when we were writing the spec, 
put it that way, and it wasn't my intention. Unfortunately I can't 
speak for what was in anyone else's mind.

BTW, my experience on SPARQL has led me to think that we didn't do a 
good enough job of explaining the idea of blank nodes in the RDF 
spec. You seem to have grokked it thoroughly.

>However, let me give the reasons that I held it, based on my
>reading of RDF and SPARQL both:
>
>a) According to the "RDF Concepts and abstract syntax" Recommendation,
>section 6.6 "Blank nodes", the set of RDF blank nodes is distinct
>from the set of IRIs and "Otherwise, this set of blank nodes
>is arbitrary.  RDF makes no reference to any internal structure
>of blank nodes".  That is, RDF blank nodes have no label.

They have no label *in an RDF graph*. Other document conventions 
might 'label' them in ways determined by the specs for those 
documents.

>b) The RDF Primer section 2.3 "Structured property values and blank
>nodes" Figure 6 "Using a blank node" shows a blank node as having
>no label.  It goes on to describe "blank node identifiers" of which
>it says "...blank node identifiers are not considered to be actual
>parts of the RDF graph."
>c) In our own working draft Section 2.5.3 "Example of basic graph
>pattern matching" second sentence under the first box, it says
>"The label information is not in the graph."
>
>d) Section 2.8.3 "Blank nodes" says "Blank nodes have labels
>which are scoped to the query".  However, RDF blank nodes have
>no notion of scope (they simply exist, just as IRIs and literals
>exist, with no notion of scope)

Exactly. But look at that sentence that you quote: the LABEL is 
scoped to the query, not the blank node. Scoping, as you say, is a 
lexical matter.

>.  Scope is a lexical concept
>(the portion of a query text in which an identifier has a single
>referent).
>
>My summary is that the consistent stance of the RDF Recommendations
>is that blank node identifiers are an artefact of serialization.

Exactly. We assumed it would be permissible to be allowed a slight 
abuse of terminology, in using the serialization token to refer to 
the blank node it is a token of (in the context of the document under 
discussion: in the above case, the query document): after all, that 
is what these tokens are FOR, to refer to blank nodes. This kind of 
abuse of terminology is widely used and familiar, and it avoids what 
would otherwise be rather tedious circumlocutions like "the blank 
node whose token is", which is like saying "the person whose name is 
Fred" rather than just saying "Fred".

But given your confusion after what is clearly an extremely careful 
reading, we should perhaps have been more pedantic, indeed.

>Now if a reader comes to the SPARQL draft with that model, he
>finds it very confusing (certainly I did).

Apologies. SPARQL is indeed based on that model, and so a reader 
making your voyage should find it more transparent.

>  For example, section
>2.4 talks about how to extend a pattern solution S to graph
>patterns.  It says "If v is not in the domain of S, then S(v)
>is defined to be v."  Applied to SPARQL blank nodes such as
>_:a, this says S(_:a) is _:a.  Fine; it is still a lexical
>token; there has been no mention of creating a blank node
>corresponding to the label _:a.

Oh, but come come, surely now you are being a little TOO pedantic. If 
a document claims to be using (near) RDF conventions and uses an RDF 
blank node identifier syntax, surely it is not unreasonable to 
presume that this is intended to indicate a blank node in an RDF 
graph(-like) structure of which the document is a lexicalization. We 
have earlier defined SPARQL patterns as RDF-graph-like things. 
containing genuine blank nodes. True, we do not formally distinguish 
patterns from their lexicalizations, but it seems clear that the 
intention here is to continue and slightly extend the RDF model to 
similar structures containing variables. No?

>  As a result, the mapping of
>a triple pattern, such as
>
>  ?x :v _:a
>
>is
>
>  (S(?x), :v, _:a)
>
>and there still is no RDF blank node.

What else would _:a be considered to be a lexicalization of?

>Consequently, the result
>of the mapping is not an RDF triple.  Then we come to
>setion 2.5.1 "General framework" and the definition of "basic
>graph pattern E-matching".  This definition posits a basic
>graph pattern BGP' and a scoping graph G' such that "G' and
>BGP' do not share any blank node labels".

Whoops. That shouldn't say 'label', indeed.

>  But how can they?
>BGP' is a triple pattern and might contain SPARQL blank nodes;
>G' is an RDF graph and as such does not contain anything that
>can be called a blank node label at all (though serializations of G'
>might).
>
>After studying Pat Hayes's email, my conclusion is that the
>text is using blank node identifiers as proxies or surrogates
>for the blank nodes themselves.

Yes, a fair diagnosis. That is exactly the 'abuse of notation' I 
mentioned above.

>To clarify our text, my proposed resolution is as follows:
>
>a) We should adopt the term "blank node identifier" for what I have
>been calling SPARQL blank nodes.  This would harmonize with RDF
>Recommendations, which use this term when talking about
>character strings associated with blank nodes for identification
>purposes.  For example, section 2.1.4 would be renamed "Syntax
>for blank node identifiers".  We should scan the document for
>other occurrences of "blank node", and, as appropriate, change to
>"blank node identifier".

Good idea.

>
>b) We state explicitly that for each distinct blank node identifier,
>a distinct blank node is created for the purposes of processing
>the query, different from any blank node in the graphs in the
>query's dataset.

Er...be careful. I don't think we should phrase this in terms of 
*creation* of blank nodes; that's a bit like saying that when you 
write a numeral, you create a number. Documents written using RDF 
lexicalization conventions indicate RDF abstract graph structures 
which might contain blank nodes: OK so far. The only questions that 
we have to determine about blank nodes (the only question that can be 
asked about them, in fact) has to do with their identity. If a single 
document scope uses several occurrences of a bnode identifier, then 
they identify the same bnode in whatever structure is indicated by 
the document. Otherwise, all that SPARQL has to say is when two 
bnodes are NOT the same, which is what the 'scoping graph' 
definitions are all about.

>We can also say that the reader may wish to
>think of the blank node identifiers as proxies or surrogates for
>these created blank nodes.

Why do we need to say this? This is just part of the RDF graph 
syntax/lexicalization model. At most, I think we might say explicitly 
that SPARQL syntax is an extension of RDF syntax, and inherits the 
RDF distinction between lexical scope of bnodeIDs, and the actual 
occurrence of a bnode in an RDF graph.

>  Perhaps this might go in Section 2.1.4.
>
>c) In section 2.1.8 "Result descriptions used in this document"
>in the definition of RDF term, the created blank nodes  should be
>explicitly listed as part of RDF-B.  (Note that even if one
>believed that blank node identifiers were blank nodes all along,
>this did not put them in RDF-B because they were not part of
>any graph.)

I agree we should be more explicit about the bnode/id distinction here.

>
>d) In section 2.4 "Pattern solutions", definition of "pattern
>solution", we say that the domain of S is extended to include
>blank node identifiers by mapping each blank node identifier to
>the blank node that was created for it in item b) above.

Both the above are supposed to be handled by the 'scoping graph' 
idea. The scoping graph's sole purpose is to be the source of bnodes 
substituted for pattern variables in the answer document, to allow 
this source to be something different from (but isomorphic to) the 
source graph, and to be unique for each query. So rather than 
'creating' bnodes, SPARQL technically 'creates' a scoping graph and 
then simply *uses* the bnodes in it. I guess it comes to the same 
thing, but this way of talking about it makes sure that the bnodes in 
the scoping graph fit into an isomorphic structure as the target 
graph, so that the answer document is obliged to treat these "bnodes 
from the (current bnode-substituted version of the) target graph" in 
a way that makes sense across several answers. It is hard to express 
this as a condition on bnodeIDs.

>e) Somewhere we make the observation that the result of
>applying a pattern solution S to a triple pattern is an RDf triple.
>Thus if BGP is a basic graph pattern, then S(BGP)
>is an RDF graph.
>
>f) delete the definition of "basic graph pattern equivalence"
>(changes proposed below make it dispensable).
>
>g) delete the definition of "scoping graph", also unneeded.

I think it (or something like it) is needed, see above.

>h) Reword the definition of basic graph pattern matching
>to use the notion of graph merge found in the RDF Recommendations.
>The revised definition is something like this: "Given an
>entailment regime E, a basic graph pattern BGP, an RDF graph
>G and a pattern solution S whose range is a subset of B, then
>BGP E-matches with pattern solution S on graph G with
>respect to scoping set B if G E-entails the graph merge of G
>and S(BGP)."  Actually, with the statement that the created
>blank nodes are distinct from all blank nodes in the dataset,
>a simple set union will suffice, though we may wish to stick
>with the RDF notion of merge for consistency with RDF.

It's not that simple, unfortunately. We went through a huge 
discussion over this, and simply using merging doesn't cut it. The 
scoping graph ideas was one result of this long discussion (which is 
in the email record, should you wish to peruse it, though Im not sure 
its a good idea :-)

>i) If we want to keep the technique of renaming blank node
>identifiers, we move that outside the boxed definition into
>explanatory text.  For example, "The graph merge referred to
>in the preceding definition can be thought of as using
>blank node identifiers as proxies for the blank nodes.  In that
>case, care must be taken to ensure that the blank node identifiers
>of G are different from all blank node identifiers in BGP.
>Let G' and BGP' be serializations of G and BGP, respectively,
>such that all blank node identifiers in G' are different from
>all blank node identifiers in BGP'.   Then G' UNION BGP'
>is the serialization of some graph G2.  S is a solution for BGP using
>E entailment if G E-entails G2."
>
>j) In section 2.5.2 "SPARQL basic graph pattern matching" last
>paragraph, we can clarify that pattern solutions are unique,
>not just unique up to blank node renaming.  The so-called "blank
>node renaming" is an artefact of serialization.

Well, strictly yes: but there is a similar notion which we might call 
blank node substitution, and they aren't proof against that, so in 
fact they aren't *unique*, strictly speaking. The point is that blank 
nodes do have an identity, according to the RDF model, so if one 
takes an RDF graph and a set of blank nodes which do not occur in it, 
and substitutes these for the bnodes in the graph, then you do have a 
*different* graph. Isomorphic, true: but different, all the same. And 
this is not an artefact of serialization, but is inherent in the idea 
that blank nodes can be distinguished from one another, which if you 
think about it is about all that can be done with them. This is why 
RDF needed to distinguish between merging and taking a simple union 
when talking about graphs (not serializations of a graph).

>  The last sentence
>is thus "the serialization of a set of all pattern solutions
>is unique up to blank node identifiers".  We can also delete the phrase
>"...possibly with blank nodes renamed" earlier in the paragraph,
>because a pattern solution is not actually concerned with
>assigning blank node identifiers.

But I agree we should go through the text carefully and try to remove 
as many traces as possible of the token/node ambiguity. There is 
definitely a serious muddle in the current version of 2.5.1., which 
is why I voted against it.

Pat


>Fred


-- 
---------------------------------------------------------------------
IHMC		(850)434 8903 or (650)494 3973   home
40 South Alcaniz St.	(850)202 4416   office
Pensacola			(850)202 4440   fax
FL 32502			(850)291 0667    cell
phayesAT-SIGNihmc.us       http://www.ihmc.us/users/phayes
Received on Friday, 9 June 2006 16:40:12 UTC