Re: comments on SPARQL Query Language for RDF from Pat Hayes on 2007-05-30 (public-rdf-dawg-comments@w3.org from May 2007)

From: Pat Hayes <phayes@ihmc.us>
Date: Tue, 29 May 2007 19:54:09 -0700
To: Bob MacGregor <bmacgregor@siderean.com>
Cc: public-rdf-dawg-comments@w3.org, "Eric Prud'hommeaux" <eric@w3.org>, "Richard Newman" <rnewman@franz.com>
Message-Id: <p06230923c28292da6050@[192.168.1.4]>
>Hi Pat,
>
>
>On May 29, 2007, at 1412, Pat Hayes wrote:
>
>>
>>>Hi Richard,
>>>
>>>On May 28, 2007, at 1435, Richard Newman wrote:
>>>
>>>>Hi Bob,
>>>>
>>>><snip>
>>>>
>>>>   Regarding point 2: yes, AllegroGraph allows 
>>>>you to store whatever you like in the graph 
>>>>field of a triple. Other stores might not. 
>>>>I'm not sure that I agree with you about 
>>>>naming -- why not mint URIs, or use UUID 
>>>>URNs? You can cram almost anything into a 
>>>>URI! -- but you can certainly use variables 
>>>>in your queries.
>>>>
>>>
>>>The phrase "mint URIs" raises a red flag, 
>>>since it is frequently contrary to the whole 
>>>point of a URI.  That is definitely true in 
>>>this case.
>>>Suppose I have two graphs with identical 
>>>triples, and identical provenance attached to 
>>>their "graph names".  I claim that these
>>>two graphs should be considered equivalent. 
>>>If the graphs are identified with blank nodes, 
>>>then that is indeed the case. Otherwise,
>>>its not.  The presence of a URI overdefines 
>>>the semantics of the provenance.  Does this 
>>>matter?  Indeed it does.  Our quad store
>>>does union and collapsing operations on 
>>>provenance to increase performance (sometimes 
>>>by orders of magnitude).  The operations
>>>it performs are not valid if URIs are present. 
>>>I would not be surprised if AllegroGraph does 
>>>not yet incorporate these optimizations.
>>>However, once you start to use sufficiently 
>>>aggressive provenance, its likely you will 
>>>want to do the same.
>>>
>
>>?? Bob, what are you talking about? Lets agree 
>>for the moment with your claim that the two 
>>graphs should be equivalent (though Im having 
>>trouble understanding how they can have 
>>*identical* provenance information if one is a 
>>copy of another; perhaps we mean something 
>>different by 'provenance'). You say that if 
>>they have different names, they cannot be 
>>equivalent. Why not? The entire RDF/URI model 
>>allows a single entity to have more than one 
>>name. The point of URIs is to identify, but not 
>>to identify uniquely. So in fact the two graphs 
>>can be identical, if you like, like two 
>>imprints of the same edition of a novel.
>>
>
>I guess I need to be a bit more explicit about 
>the phrase 'equivalent';  since we deal with 
>quads in our own system instead of
>triples, our notion of equivalence has evolved. 
> So I will be more careful here:
>
>I didn't say that one graph was a copy of the 
>other.  I said that they had identical triples, 
>i.e., an equivalence test that
>ignored provenance would return true.

OK.

>  If the graph names are N1 and N2, I also 
>asserted that provenance assertions/triples 
>about N1 and N2
>are also the same (same dc:source, same dc:date, etc.).

OK again.

>  I'm not asking if the two graphs can be identical, I'm asking if they ARE
>identical.  If names matter (and they do), then 
>absent an owl:sameAs assertion between N1 and 
>N2, the graphs cannot be
>assumed to be the same.

Their identity is not entailed by anything. But 
it would not be a contradiction to assume they 
were identical. Are you afraid to make this 
assumption? Why would you be? If their 
provenances are identical, what could possibly 
distinguish them?

>  If names don't matter, e.g., if blank nodes are 
>substituted for the names, then logically the 
>graphs,
>including their provenance, are indistinguishable.

If you use a blank node as a name, I don't think 
that means anything at all according the RDF 
semantics. If you treat blank node IDs as real 
identifiers - which isn't strictly RDF legal, but 
if you do - then different bnodes are just as 
different as different URIs. Either way, using 
bnodes as names doesn't get you anywhere.

>
>In general, the kind of merging we want to do to 
>preserve scalability in the presence of large 
>scale provenance includes the
>ability to merge two graphs into one when their 
>provenance triples are the same.

Well then, go ahead and do that. I don't see what 
is stopping you. Nothing in the RDF or SPARQL 
specs would prohibit this.

>  Specifically, we don't usually care about
>equivalence between the contents of two graphs, 
>but we do care about equivalence between the 
>provenance statements attached to
>graphs.

OK, fair enough. I guess the sharp edge here is 
knowing that you have *all* the provenance 
information.

>
>>Why are your optimizing collapsings not valid 
>>if URIs are present? You can simply declare 
>>that your identity criteria on graphs allow a 
>>graph (not a named graph, but an RDF graph) to 
>>have more than one name without being a 
>>different graph. You are free to impose extra 
>>semantics on the basic RDF model if you find it 
>>useful.
>>
>
>I could also declare that for us, URIs don't 
>matter within a graph, and we can collapse 
>arbitrary triples if the literals are the same. 
>But that
>would be absurd.

Of course it would. But nobody is suggesting 
that. You want to do an optimisation which you 
feel is reasonable, to merge isomorphic graphs 
with the same provenance. As far as I can see, 
that amounts to your having the confidence to 
assume that identical provenances guarantees 
identity. OK, then it still does no matter what 
names are used to refer to the graphs. The use of 
a name is just that: the use of a name. It does 
not imply anything.

>  I am assuming that if URIs are used to name 
>graphs, then their is some reason why they are 
>used

Don't assume that. THAT assumption is in 
violation of the RDF semantics, ironically. The 
name used to refer to something says nothing at 
all about the thing it refers to. It is just a 
name.

>in preference to
>blank nodes, which are currently illegal, as far as I understand.

They are meaningless rather than illegal. Ask 
yourself, what would it mean to use an 
existentially bound variable as a name?

>  Of course, I'm not using a blank node to name a graph, I'm using it
>to refer to a graph.

Name, refer, identify, whatever: it doesn't do 
any of these. Think of it as an existentially 
bound variable, with the quantifier 'outside' the 
entire Web, and different from any other such 
variable.

>
>>Nothing in RDF or SPARQL suggests that 
>>different names cannot denote the same thing.
>>
>
>I never said or implied that they can't.

BUt you seem to be assuming that because two 
names are used to refer, that this multiple-name 
useage alone is enough to make you lose 
confidence in your reasons for assuming identity 
(based on identical provenance). This loss of 
confidence is misplaced, and isn't based on 
anything in the RDF or SPARQL semantics.

>
>>
>>A further puzzle is that you are happy if the 
>>name is a blank node... do I have that right? 
>>That simply does not make sense to me. Blank 
>>nodes cannot be used as names or identifiers. 
>>The meaning of a blank node is to express an 
>>existential assertion. Using a blank node as an 
>>identifier is meaningless.
>>
>
>My claim is that I should be able to manipulate 
>graphs and assign them provenance without the 
>need for naming the graphs.
>We are dealing with applications where the we 
>may have 150,000 graphs (give or take an order 
>of magnitude).   There is no benefit
>to be derived by naming them.

That may be so, but that is a different point. 
You were claiming that the presence of names 
somehow prevented you from applying an 
optimization step. Im saying that it does not.

However, I am at a loss to understand how you 
refer to these 150,000 graphs if you have no way 
to name them. How do you even know how many you 
have?  (It sounds from your description that you 
are in effect treating the provenance as *being* 
the name of the graph. Does that perspective help 
reconcile things? )

>
>I've observed that people's thinking is 
>frequently circumscribed by the nomenclature 
>they use.  This is likely the case for
>"named graphs".  The SPARQL spec says that we 
>can have only one unnamed graph; all of the 
>others must have names.

There has to be some way for the query to refer 
to them. If you can think of way of doing this 
without somehow naming them, please explain it.

>In our applications, we have very large numbers of unnamed graphs.

OK. Do you always query against the same set of 
unnamed graphs? If so, you can treat this as a 
single graph for purposes of defining a SPARQL 
query answer. If not, how do you propose that a 
query will specify which of the 150,000 are 
supposed to be used in answering the query?

Pat
-- 
---------------------------------------------------------------------
IHMC		(850)434 8903 or (650)494 3973   home
40 South Alcaniz St.	(850)202 4416   office
Pensacola			(850)202 4440   fax
FL 32502			(850)291 0667    cell
phayesAT-SIGNihmc.us       http://www.ihmc.us/users/phayes
Received on Wednesday, 30 May 2007 02:54:27 UTC