Re: comments on SPARQL Query Language for RDF from Bob MacGregor on 2007-05-30 (public-rdf-dawg-comments@w3.org from May 2007)

From: Bob MacGregor <bmacgregor@siderean.com>
Date: Tue, 29 May 2007 17:03:10 -0700
To: Pat Hayes <phayes@ihmc.us>
Cc: public-rdf-dawg-comments@w3.org, "Eric Prud'hommeaux" <eric@w3.org>, "Richard Newman" <rnewman@franz.com>
Message-Id: <47D358C3-730D-4FCF-B17F-BF1FE01EC8EE@siderean.com>
Hi Pat,


On May 29, 2007, at 1412, Pat Hayes wrote:

>
>> Hi Richard,
>>
>> On May 28, 2007, at 1435, Richard Newman wrote:
>>
>>> Hi Bob,
>>>
>>> <snip>
>>>
>>>   Regarding point 2: yes, AllegroGraph allows you to store  
>>> whatever you like in the graph field of a triple. Other stores  
>>> might not. I'm not sure that I agree with you about naming -- why  
>>> not mint URIs, or use UUID URNs? You can cram almost anything  
>>> into a URI! -- but you can certainly use variables in your queries.
>>>
>>
>> The phrase "mint URIs" raises a red flag, since it is frequently  
>> contrary to the whole point of a URI.  That is definitely true in  
>> this case.
>> Suppose I have two graphs with identical triples, and identical  
>> provenance attached to their "graph names".  I claim that these
>> two graphs should be considered equivalent.  If the graphs are  
>> identified with blank nodes, then that is indeed the case. Otherwise,
>> its not.  The presence of a URI overdefines the semantics of the  
>> provenance.  Does this matter?  Indeed it does.  Our quad store
>> does union and collapsing operations on provenance to increase  
>> performance (sometimes by orders of magnitude).  The operations
>> it performs are not valid if URIs are present.  I would not be  
>> surprised if AllegroGraph does not yet incorporate these  
>> optimizations.
>> However, once you start to use sufficiently aggressive provenance,  
>> its likely you will want to do the same.

> ?? Bob, what are you talking about? Lets agree for the moment with  
> your claim that the two graphs should be equivalent (though Im  
> having trouble understanding how they can have *identical*  
> provenance information if one is a copy of another; perhaps we mean  
> something different by 'provenance'). You say that if they have  
> different names, they cannot be equivalent. Why not? The entire RDF/ 
> URI model allows a single entity to have more than one name. The  
> point of URIs is to identify, but not to identify uniquely. So in  
> fact the two graphs can be identical, if you like, like two  
> imprints of the same edition of a novel.
>

I guess I need to be a bit more explicit about the phrase  
'equivalent';  since we deal with quads in our own system instead of
triples, our notion of equivalence has evolved.   So I will be more  
careful here:

I didn't say that one graph was a copy of the other.  I said that  
they had identical triples, i.e., an equivalence test that
ignored provenance would return true.  If the graph names are N1 and  
N2, I also asserted that provenance assertions/triples about N1 and N2
are also the same (same dc:source, same dc:date, etc.).  I'm not  
asking if the two graphs can be identical, I'm asking if they ARE
identical.  If names matter (and they do), then absent an owl:sameAs  
assertion between N1 and N2, the graphs cannot be
assumed to be the same.  If names don't matter, e.g., if blank nodes  
are substituted for the names, then logically the graphs,
including their provenance, are indistinguishable.

In general, the kind of merging we want to do to preserve scalability  
in the presence of large scale provenance includes the
ability to merge two graphs into one when their provenance triples  
are the same.  Specifically, we don't usually care about
equivalence between the contents of two graphs, but we do care about  
equivalence between the provenance statements attached to
graphs.

> Why are your optimizing collapsings not valid if URIs are present?  
> You can simply declare that your identity criteria on graphs allow  
> a graph (not a named graph, but an RDF graph) to have more than one  
> name without being a different graph. You are free to impose extra  
> semantics on the basic RDF model if you find it useful.

I could also declare that for us, URIs don't matter within a graph,  
and we can collapse arbitrary triples if the literals are the same.   
But that
would be absurd.  I am assuming that if URIs are used to name graphs,  
then their is some reason why they are used in preference to
blank nodes, which are currently illegal, as far as I understand.  Of  
course, I'm not using a blank node to name a graph, I'm using it
to refer to a graph.

> Nothing in RDF or SPARQL suggests that different names cannot  
> denote the same thing.

I never said or implied that they can't.
>
> A further puzzle is that you are happy if the name is a blank  
> node... do I have that right? That simply does not make sense to  
> me. Blank nodes cannot be used as names or identifiers. The meaning  
> of a blank node is to express an existential assertion. Using a  
> blank node as an identifier is meaningless.
>

My claim is that I should be able to manipulate graphs and assign  
them provenance without the need for naming the graphs.
We are dealing with applications where the we may have 150,000 graphs  
(give or take an order of magnitude).   There is no benefit
to be derived by naming them.

I've observed that people's thinking is frequently circumscribed by  
the nomenclature they use.  This is likely the case for
"named graphs".  The SPARQL spec says that we can have only one  
unnamed graph; all of the others must have names.
In our applications, we have very large numbers of unnamed graphs.   
To follow what I'm driving at, you will have to adjust your
thinking, and toss out the notion of naming a graph.

Cheers, Bob

> Pat
>
> -- 
> ---------------------------------------------------------------------
> IHMC		(850)434 8903 or (650)494 3973   home
> 40 South Alcaniz St.	(850)202 4416   office
> Pensacola			(850)202 4440   fax
> FL 32502			(850)291 0667    cell
> phayesAT-SIGNihmc.us       http://www.ihmc.us/users/phayes
>
>

Bob MacGregor
Chief Scientist
Siderean Software, Inc.
310.647.5690
bmacgregor@siderean.com
Received on Wednesday, 30 May 2007 00:03:31 UTC