Re: RDF-ISSUE-17 (graph merge): How are RDF datasets to be merged? [RDF Graphs] from Axel Polleres on 2011-03-29 (public-rdf-dawg@w3.org from January to March 2011)

From: Axel Polleres <axel.polleres@deri.org>
Date: Tue, 29 Mar 2011 21:29:43 +0100
To: Souripriya Das <SOURIPRIYA.DAS@oracle.com>
Cc: "SPARQL Working Group" <public-rdf-dawg@w3.org>
Message-Id: <D431C1D0-DC69-4571-99C8-92FB3CE58259@deri.org>
Hi Souri, 

there seems to be a misunderstanding....

On 29 Mar 2011, at 16:07, Souripriya Das wrote:

> I am a bit rusty on the exact constructs right now (was too busy with the RDB2RDF spec), but it is relevant to construct(s) that let one append triples (from file, from other graphs, ...) into a graph.
> If there was no reuse-bnode option, we add new <property, value> pairs to a resource that was represented originally by a bnode identifier (instead of a URI).

Again: reuse of bnoeds is the current default behaviour for INSERT/ADD.

> 
> So, take the example again:
> 
> In graph G1, someone had originally used bnode _:b1 to describe a resource:
> _:b1   :name   <Rambo>
> 
> Later, s/he found out that the guy is 60 year old and s/he wants to add this info to graph G1:
> _:b1       :age         "60"^^xsd:integer
> 
> But, this can only be achieved if we allow reuse of bnode identifier. The bnode identifier, _:b1, in the new batch (of 1) triple being appended, MUST BE identified to the bnode _:b1 that has already been used in the graph. If _:b1 in the batch gets standardized to a separate identifier, say _:b123456, then the result of the append operation is completely different than what the user wanted.

You can do exactly that with 

INSERT {   :age    "60"^^xsd:integer }
WHERE { ?X   :name   <Rambo> }


> This is a common use case and so I thought I'd suggest use of an option that lets the user get this functionality of reusing (or merging?) bnode identifiers in a batch (being appended) into the pre-existing bnode identifiers in a graph.
> 
> Thanks,
> - Souri.
> 
> Axel Polleres wrote:
>> On 29 Mar 2011, at 15:39, Souripriya Das wrote:
>> 
>>   
>> 
>>> Axel,
>>> 
>>> Given that loading in batches into the same graph is common in practice, should we consider adding an option, REUSE BNODE (or something similar), to the relevant SPARQL update statements  (e.g., LOAD, etc.) to allow specifying "reuse bnode" intention without having to do it artificially?
>>>     
>>> 
>> 
>> Note: Reuse is *not* the default for *LOAD*, but for *ADD*. When you mentioned "append" below, I assumed you meant ADD.
>> given that, I am not 100% sure whether I understand what you're asking for now. 
>> 
>>   
>> 
>>> I think such an option will be very simple to add. The default could be the "NOREUSE BNODE" option.
>>> 
>>>     
>>> 
>> 
>> You mean that option for ADD, yes?
>> The only problem I see here is that such an ADD would be no longer 
>> a straightforward shortcut for INSERT... it would indeed need to be defined in terms of Dataset-MERGE it seems.
>> 
>> Thanks,
>> Axel
>> 
>>   
>> 
>>> Thanks,
>>> - Souri.
>>> 
>>> Axel Polleres wrote:
>>>     
>>> 
>>>> Hi Souri,
>>>> 
>>>> answer below...
>>>> 
>>>> On 29 Mar 2011, at 15:10, Souripriya Das wrote:
>>>> 
>>>>   
>>>> 
>>>>       
>>>> 
>>>>> A related comment regarding bnodes in the context of appending triples to an existing graph:
>>>>> 
>>>>> When user wants to append some triples into an existing graph, s/he may have one of two intentions regarding bnode identifiers:
>>>>> 	• "reuse bnode":  reuse the bNode identifiers
>>>>> 	• "do not reuse bnode":  not reuse the the bnode identifiers
>>>>> Given a graph G1 with the following triple content:
>>>>> 
>>>>> _:b1   :name   <Rambo>
>>>>> 
>>>>> if user wants to append the following triple to G1:
>>>>> 
>>>>> _:b1   :age  "60"^^xsd:integer
>>>>> 
>>>>> One should be able to get the appropriate end result based on his/her intention:
>>>>> 	• if user wants "reuse bnode", then the final content of graph <G1> should be:
>>>>> 		• _:b1   :name   <Rambo>
>>>>> 		• _:b1   :age  "60"^^xsd:integer
>>>>> 	• if user wants "do not reuse bnode", then the final content of graph <G1> should be
>>>>> 		• _:b1   :name   <Rambo>
>>>>> 		• _:b123456   :age  "60"^^xsd:integer
>>>>> I am not sure if our current spec allows a user to get both of these functionality.
>>>>>     
>>>>> 
>>>>>         
>>>>> 
>>>> the latter can only be achieved somewhat "artificially" e.g. by using the BNODE() function in a subquery.
>>>> the former is the default behavior in update at the moment.
>>>> 
>>>> HTH,
>>>> Axel
>>>> 
>>>>  
>>>>   
>>>> 
>>>>       
>>>> 
>>>>> Thanks,
>>>>> - Souri.
>>>>> 
>>>>> Steve Harris wrote:
>>>>>     
>>>>> 
>>>>>         
>>>>> 
>>>>>> On 2011-03-29, at 14:54, Axel Polleres wrote:
>>>>>> 
>>>>>>   
>>>>>> 
>>>>>>       
>>>>>> 
>>>>>>           
>>>>>> 
>>>>>>> On 29 Mar 2011, at 14:52, Steve Harris wrote:
>>>>>>> 
>>>>>>>     
>>>>>>> 
>>>>>>>         
>>>>>>> 
>>>>>>>             
>>>>>>> 
>>>>>>>> On 2011-03-29, at 14:28, Axel Polleres wrote:
>>>>>>>> 
>>>>>>>>       
>>>>>>>> 
>>>>>>>>           
>>>>>>>> 
>>>>>>>>               
>>>>>>>> 
>>>>>>>>> I fwd this discussion from the RDFWG concerning dataset merge, as it may affect some of the definitions we have in SPARQL Update...
>>>>>>>>> 
>>>>>>>>> Short version:
>>>>>>>>> 
>>>>>>>>> We use Dataset-UNION now in SPARQL update, which is used for INSERT operations, and - deliberately - keeps bnode labels.
>>>>>>>>>         
>>>>>>>>> 
>>>>>>>>>             
>>>>>>>>> 
>>>>>>>>>                 
>>>>>>>>> 
>>>>>>>> Sorry if I missed a discussion, but I've been unusually busy recently.
>>>>>>>> 
>>>>>>>> Does that mean that given:
>>>>>>>> 
>>>>>>>> G1:
>>>>>>>> _:b1 a <Thing> .
>>>>>>>> 
>>>>>>>> INSERT {
>>>>>>>>  GRAPH <G2> { ?x a <Thing> }
>>>>>>>> }
>>>>>>>> WHERE {
>>>>>>>>  GRAPH <G2> { ?x a <Thing> }
>>>>>>>> }
>>>>>>>> 
>>>>>>>> You will end up with:
>>>>>>>> 
>>>>>>>> G1:
>>>>>>>> _:b1 a <Thing> .
>>>>>>>> 
>>>>>>>> G2:
>>>>>>>> _:b1 a <Thing> .
>>>>>>>> 
>>>>>>>> And that
>>>>>>>> 
>>>>>>>> SELECT DISTINCT ?x
>>>>>>>> WHERE {
>>>>>>>>  ?x a <Thing> .
>>>>>>>> }
>>>>>>>> 
>>>>>>>> will return just one row?!
>>>>>>>>       
>>>>>>>> 
>>>>>>>>           
>>>>>>>> 
>>>>>>>>               
>>>>>>>> 
>>>>>>> It seems, that one will return nothing, since you query the default graph here (which is empty?)
>>>>>>>     
>>>>>>> 
>>>>>>>         
>>>>>>> 
>>>>>>>             
>>>>>>> 
>>>>>> Gah, sorry too used to 4/5store, where default graph = union of named graphs. 
>>>>>> 
>>>>>> SELECT DISTINCT ?x
>>>>>> WHERE {
>>>>>>  GRAPH ?g {
>>>>>>    ?x a <Thing> .
>>>>>>   }
>>>>>> }
>>>>>> 
>>>>>> - Steve
>>>>>> 
>>>>>>   
>>>>>> 
>>>>>>       
>>>>>> 
>>>>>>           
>>>>>> 
>>>>>>>>> the RDF-WG rather seems to need something like Dataset-MERGE where bnodes are standardized apart.
>>>>>>>>> On first thought, I don't think we need that in SPARQL Update, because (as mentioned) we need to preserve bnodes
>>>>>>>>> to get meaningful results, in most cases. However, one might think of reasonable exceptions such as:
>>>>>>>>> 
>>>>>>>>> a) the LOAD operation... we use Dataset-UNION here, but may actually wish to guarantee that no bnode interferences happen between bnodes already in the graph store and the loaded bnodes.
>>>>>>>>> 
>>>>>>>>> However, I suggest to keep Dataset-UNION for OpLoad and rather stress that the graph() function that retrieves a graph to be loaded is responsible for keeping bnode labels disjoint from those used in the graph store already. The reason for that is that Dataset-MERGE would also affect the bnodes in the graph store, which we don't want to change. So this one is eassy to fix...
>>>>>>>>> 
>>>>>>>>> b) The second example that one might think would make sense would be to have ADD not preserving bnodes... what is worrying me here a bit is the fact that graphs in diffferent named graphs may have overlapping bnode labels, and that an ADD (likewise any INSERT that transfers data between graphs in the graph store) may result in unexprected new co-references... example.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> graph <a>   _:b1 :p _:b2 .
>>>>>>>>> graph <b>   _:b2 :p _:b1 .
>>>>>>>>> 
>>>>>>>>> Now note that
>>>>>>>>>  ADD <a> TO <b>
>>>>>>>>> will result in:
>>>>>>>>> 
>>>>>>>>> graph <a>   _:b1 :p _:b2 .
>>>>>>>>> graph <b>   _:b2 :p _:b1 . _:b1 :p _:b2 .
>>>>>>>>> 
>>>>>>>>> that is, bnode labels matter...  since now we have created a coreference in graph <b> which wouldn't have happended if ADD would rely on MERGE, i.e. where the result would be something like:
>>>>>>>>> 
>>>>>>>>> graph <a>   _:b1 :p _:b2 .
>>>>>>>>> graph <b>   _:b3 :p _:b4 . _:b5 :p _:b6 .
>>>>>>>>> 
>>>>>>>>> Opinions?
>>>>>>>>> 
>>>>>>>>> Axel
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Begin forwarded message:
>>>>>>>>> 
>>>>>>>>>         
>>>>>>>>> 
>>>>>>>>>             
>>>>>>>>> 
>>>>>>>>>                 
>>>>>>>>> 
>>>>>>>>>> From: Axel Polleres <axel.polleres@deri.org>
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Date: 29 March 2011 14:04:46 GMT+01:00
>>>>>>>>>> To: Steve Harris 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> <steve.harris@garlik.com>
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Cc: Andy Seaborne 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> <andy.seaborne@epimorphics.com>, RDF Working Group WG <public-rdf-wg@w3.org>
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Subject: Re: RDF-ISSUE-17 (graph merge): How are RDF datasets to be merged? [RDF Graphs]
>>>>>>>>>> 
>>>>>>>>>> Hi all,
>>>>>>>>>> 
>>>>>>>>>> just to understand... Having an additional Dataset-MERGE definition in the SPARQL-Update doc, doesn't seem to be a problem, I agree.
>>>>>>>>>> However, it wouldn't necessarily be used in SPARQL itself (probably not a big problem)
>>>>>>>>>> 
>>>>>>>>>> </rdfwg>
>>>>>>>>>> ... wait a second... thinking out loud here, {RDF-WG ignore this, SPARQL WG members... I will post this on SPARQL separately)
>>>>>>>>>> 
>>>>>>>>>> thinking about e.g. the LOAD operation in SPARQL update [1], we might actually to prefer it in terms of Dataset-MERGE instead of Dataset-UNION.
>>>>>>>>>> i.e. we don't want a graph loaded from externally intermingle with the bnodes already in a graph store.
>>>>>>>>>> 
>>>>>>>>>> Note that this may also affect the ADD operation...  where you may actually which to keep bnode labels separate? But that's not necessarily an issue, because I guess we may assume rthat bnode labels are disjoint within disjoint graphs within a graph store anyways...
>>>>>>>>>> 
>>>>>>>>>> 1. 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> http://www.w3.org/2009/sparql/docs/update-1.1/Overview.xml#def_loadoperation
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> <rdfwg>
>>>>>>>>>> 
>>>>>>>>>> That actually brings me back to RFG: Can/Shall we assume within a dataset in RDF that graphs within the same dataset don't share bnode identifiers? I think this could be a useful assumption and make many things easier.
>>>>>>>>>> 
>>>>>>>>>> Axel
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On 29 Mar 2011, at 09:57, Steve Harris wrote:
>>>>>>>>>> 
>>>>>>>>>>           
>>>>>>>>>> 
>>>>>>>>>>               
>>>>>>>>>> 
>>>>>>>>>>                   
>>>>>>>>>> 
>>>>>>>>>>> Yeah, I agree - it's the most logical place to look for it.
>>>>>>>>>>> 
>>>>>>>>>>> - Steve
>>>>>>>>>>> 
>>>>>>>>>>> On 2011-03-28, at 20:38, Andy Seaborne wrote:
>>>>>>>>>>> 
>>>>>>>>>>>             
>>>>>>>>>>> 
>>>>>>>>>>>                 
>>>>>>>>>>> 
>>>>>>>>>>>                     
>>>>>>>>>>> 
>>>>>>>>>>>> Wouldn't it be better to put the RDF datasets merge definition along side the RDF dataset definition to put everything in one place? Splitting across docs isn't great.
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> http://www.w3.org/2009/sparql/docs/update-1.1/Overview.xml#def_datasetUnion
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> and s/union/merge/g ; s/graph store/RDf dataset/g ;
>>>>>>>>>>>> 
>>>>>>>>>>>>   Andy
>>>>>>>>>>>> 
>>>>>>>>>>>> On 28/03/11 16:07, RDF Working Group Issue Tracker wrote:
>>>>>>>>>>>>               
>>>>>>>>>>>> 
>>>>>>>>>>>>                   
>>>>>>>>>>>> 
>>>>>>>>>>>>                       
>>>>>>>>>>>> 
>>>>>>>>>>>>> RDF-ISSUE-17 (graph merge): How are RDF datasets to be merged? [RDF Graphs]
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> http://www.w3.org/2011/rdf-wg/track/issues/17
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Raised by: David Wood
>>>>>>>>>>>>> On product: RDF Graphs
>>>>>>>>>>>>> 
>>>>>>>>>>>>> The "RDF Semantics" spec defines how to merge two or more RDF graphs,
>>>>>>>>>>>>> the pain is caused by blank nodes, otherwise it's a trivial operation.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> The "SPARQL Query Language for RDF" spec defines the notion of RDF
>>>>>>>>>>>>> dataset as a set of "one graph, the default graph, which does not have
>>>>>>>>>>>>> a name, and zero or more named graphs, where each named graph is
>>>>>>>>>>>>> identified by an IRI".
>>>>>>>>>>>>> 
>>>>>>>>>>>>> How do we define how to merge RDF datasets?
>>>>>>>>>>>>> 
>>>>>>>>>>>>> One obvious answer is we merge all the default graphs and all the
>>>>>>>>>>>>> named graphs with the same IRI using the procedure defined by the "RDF
>>>>>>>>>>>>> Semantics" to merge RDF graphs.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> NB: This issue will also relate to the "Cleanup Tasks" product if the RDF Semantics document will need to change in relation to named graphs.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> At Talis, within the Talis Platform, we want to enable people to
>>>>>>>>>>>>> easily merge RDF graphs into an RDF dataset and perhaps RDF datasets
>>>>>>>>>>>>> into another RDF dataset. We also want to have these merge happen in
>>>>>>>>>>>>> real-time (i.e. as you add/remove triples from the graphs you update
>>>>>>>>>>>>> all the derived graphs/datasets).
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Thanks to Paolo Castagna of Talis for providing input to this issue.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>>                 
>>>>>>>>>>>>> 
>>>>>>>>>>>>>                     
>>>>>>>>>>>>> 
>>>>>>>>>>>>>                         
>>>>>>>>>>>>> 
>>>>>>>>>>> --
>>>>>>>>>>> Steve Harris, CTO, Garlik Limited
>>>>>>>>>>> 1-3 Halford Road, Richmond, TW10 6AW, UK
>>>>>>>>>>> +44 20 8439 8203  
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> http://www.garlik.com/
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> Registered in England and Wales 535 7233 VAT # 849 0517 11
>>>>>>>>>>> Registered office: Thames House, Portsmouth Road, Esher, Surrey, KT10 9AD
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>>             
>>>>>>>>>>> 
>>>>>>>>>>>                 
>>>>>>>>>>> 
>>>>>>>>>>>                     
>>>>>>>>>>> 
>>>>>>>>>         
>>>>>>>>>             
>>>>>>>>> 
>>>>>>>>>                 
>>>>>>>>> 
>>>>>>>> --
>>>>>>>> Steve Harris, CTO, Garlik Limited
>>>>>>>> 1-3 Halford Road, Richmond, TW10 6AW, UK
>>>>>>>> +44 20 8439 8203  
>>>>>>>> 
>>>>>>>> 
>>>>>>>> http://www.garlik.com/
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Registered in England and Wales 535 7233 VAT # 849 0517 11
>>>>>>>> Registered office: Thames House, Portsmouth Road, Esher, Surrey, KT10 9AD
>>>>>>>> 
>>>>>>>> 
>>>>>>>>       
>>>>>>>> 
>>>>>>>>           
>>>>>>>> 
>>>>>>>>               
>>>>>>>> 
>>>>>>   
>>>>>> 
>>>>>>       
>>>>>> 
>>>>>>           
>>>>>> 
>>>>   
>>>> 
>>>>       
>>>> 
>> 
>> 
>>   
>>
Received on Tuesday, 29 March 2011 20:30:20 UTC