- From: Pat Hayes <phayes@ihmc.us>
- Date: Tue, 29 May 2007 19:54:09 -0700
- To: Bob MacGregor <bmacgregor@siderean.com>
- Cc: public-rdf-dawg-comments@w3.org, "Eric Prud'hommeaux" <eric@w3.org>, "Richard Newman" <rnewman@franz.com>
>Hi Pat, > > >On May 29, 2007, at 1412, Pat Hayes wrote: > >> >>>Hi Richard, >>> >>>On May 28, 2007, at 1435, Richard Newman wrote: >>> >>>>Hi Bob, >>>> >>>><snip> >>>> >>>> Regarding point 2: yes, AllegroGraph allows >>>>you to store whatever you like in the graph >>>>field of a triple. Other stores might not. >>>>I'm not sure that I agree with you about >>>>naming -- why not mint URIs, or use UUID >>>>URNs? You can cram almost anything into a >>>>URI! -- but you can certainly use variables >>>>in your queries. >>>> >>> >>>The phrase "mint URIs" raises a red flag, >>>since it is frequently contrary to the whole >>>point of a URI. That is definitely true in >>>this case. >>>Suppose I have two graphs with identical >>>triples, and identical provenance attached to >>>their "graph names". I claim that these >>>two graphs should be considered equivalent. >>>If the graphs are identified with blank nodes, >>>then that is indeed the case. Otherwise, >>>its not. The presence of a URI overdefines >>>the semantics of the provenance. Does this >>>matter? Indeed it does. Our quad store >>>does union and collapsing operations on >>>provenance to increase performance (sometimes >>>by orders of magnitude). The operations >>>it performs are not valid if URIs are present. >>>I would not be surprised if AllegroGraph does >>>not yet incorporate these optimizations. >>>However, once you start to use sufficiently >>>aggressive provenance, its likely you will >>>want to do the same. >>> > >>?? Bob, what are you talking about? Lets agree >>for the moment with your claim that the two >>graphs should be equivalent (though Im having >>trouble understanding how they can have >>*identical* provenance information if one is a >>copy of another; perhaps we mean something >>different by 'provenance'). You say that if >>they have different names, they cannot be >>equivalent. Why not? The entire RDF/URI model >>allows a single entity to have more than one >>name. The point of URIs is to identify, but not >>to identify uniquely. So in fact the two graphs >>can be identical, if you like, like two >>imprints of the same edition of a novel. >> > >I guess I need to be a bit more explicit about >the phrase 'equivalent'; since we deal with >quads in our own system instead of >triples, our notion of equivalence has evolved. > So I will be more careful here: > >I didn't say that one graph was a copy of the >other. I said that they had identical triples, >i.e., an equivalence test that >ignored provenance would return true. OK. > If the graph names are N1 and N2, I also >asserted that provenance assertions/triples >about N1 and N2 >are also the same (same dc:source, same dc:date, etc.). OK again. > I'm not asking if the two graphs can be identical, I'm asking if they ARE >identical. If names matter (and they do), then >absent an owl:sameAs assertion between N1 and >N2, the graphs cannot be >assumed to be the same. Their identity is not entailed by anything. But it would not be a contradiction to assume they were identical. Are you afraid to make this assumption? Why would you be? If their provenances are identical, what could possibly distinguish them? > If names don't matter, e.g., if blank nodes are >substituted for the names, then logically the >graphs, >including their provenance, are indistinguishable. If you use a blank node as a name, I don't think that means anything at all according the RDF semantics. If you treat blank node IDs as real identifiers - which isn't strictly RDF legal, but if you do - then different bnodes are just as different as different URIs. Either way, using bnodes as names doesn't get you anywhere. > >In general, the kind of merging we want to do to >preserve scalability in the presence of large >scale provenance includes the >ability to merge two graphs into one when their >provenance triples are the same. Well then, go ahead and do that. I don't see what is stopping you. Nothing in the RDF or SPARQL specs would prohibit this. > Specifically, we don't usually care about >equivalence between the contents of two graphs, >but we do care about equivalence between the >provenance statements attached to >graphs. OK, fair enough. I guess the sharp edge here is knowing that you have *all* the provenance information. > >>Why are your optimizing collapsings not valid >>if URIs are present? You can simply declare >>that your identity criteria on graphs allow a >>graph (not a named graph, but an RDF graph) to >>have more than one name without being a >>different graph. You are free to impose extra >>semantics on the basic RDF model if you find it >>useful. >> > >I could also declare that for us, URIs don't >matter within a graph, and we can collapse >arbitrary triples if the literals are the same. >But that >would be absurd. Of course it would. But nobody is suggesting that. You want to do an optimisation which you feel is reasonable, to merge isomorphic graphs with the same provenance. As far as I can see, that amounts to your having the confidence to assume that identical provenances guarantees identity. OK, then it still does no matter what names are used to refer to the graphs. The use of a name is just that: the use of a name. It does not imply anything. > I am assuming that if URIs are used to name >graphs, then their is some reason why they are >used Don't assume that. THAT assumption is in violation of the RDF semantics, ironically. The name used to refer to something says nothing at all about the thing it refers to. It is just a name. >in preference to >blank nodes, which are currently illegal, as far as I understand. They are meaningless rather than illegal. Ask yourself, what would it mean to use an existentially bound variable as a name? > Of course, I'm not using a blank node to name a graph, I'm using it >to refer to a graph. Name, refer, identify, whatever: it doesn't do any of these. Think of it as an existentially bound variable, with the quantifier 'outside' the entire Web, and different from any other such variable. > >>Nothing in RDF or SPARQL suggests that >>different names cannot denote the same thing. >> > >I never said or implied that they can't. BUt you seem to be assuming that because two names are used to refer, that this multiple-name useage alone is enough to make you lose confidence in your reasons for assuming identity (based on identical provenance). This loss of confidence is misplaced, and isn't based on anything in the RDF or SPARQL semantics. > >> >>A further puzzle is that you are happy if the >>name is a blank node... do I have that right? >>That simply does not make sense to me. Blank >>nodes cannot be used as names or identifiers. >>The meaning of a blank node is to express an >>existential assertion. Using a blank node as an >>identifier is meaningless. >> > >My claim is that I should be able to manipulate >graphs and assign them provenance without the >need for naming the graphs. >We are dealing with applications where the we >may have 150,000 graphs (give or take an order >of magnitude). There is no benefit >to be derived by naming them. That may be so, but that is a different point. You were claiming that the presence of names somehow prevented you from applying an optimization step. Im saying that it does not. However, I am at a loss to understand how you refer to these 150,000 graphs if you have no way to name them. How do you even know how many you have? (It sounds from your description that you are in effect treating the provenance as *being* the name of the graph. Does that perspective help reconcile things? ) > >I've observed that people's thinking is >frequently circumscribed by the nomenclature >they use. This is likely the case for >"named graphs". The SPARQL spec says that we >can have only one unnamed graph; all of the >others must have names. There has to be some way for the query to refer to them. If you can think of way of doing this without somehow naming them, please explain it. >In our applications, we have very large numbers of unnamed graphs. OK. Do you always query against the same set of unnamed graphs? If so, you can treat this as a single graph for purposes of defining a SPARQL query answer. If not, how do you propose that a query will specify which of the 150,000 are supposed to be used in answering the query? Pat -- --------------------------------------------------------------------- IHMC (850)434 8903 or (650)494 3973 home 40 South Alcaniz St. (850)202 4416 office Pensacola (850)202 4440 fax FL 32502 (850)291 0667 cell phayesAT-SIGNihmc.us http://www.ihmc.us/users/phayes
Received on Wednesday, 30 May 2007 02:54:27 UTC