- From: Juan Sequeda <juanfederico@gmail.com>
- Date: Thu, 26 Apr 2012 12:14:31 +0200
- To: Richard Cyganiak <richard@cyganiak.de>
- Cc: Ivan Herman <ivan@w3.org>, "ashok.malhotra@oracle.com" <ashok.malhotra@oracle.com>, "public-rdb2rdf-wg@w3.org" <public-rdb2rdf-wg@w3.org>
- Message-ID: <CAMVTWDyYG3r+KF5mZVAJxXicwBU-k=1WdXSCd=045Wf4dK-OfQ@mail.gmail.com>
no need to state the effect: lean vs non lean rdf graph On Thu, Apr 26, 2012 at 12:02 PM, Richard Cyganiak <richard@cyganiak.de>wrote: > Ivan, > > See inline for responses and a PROPOSAL. > > On 26 Apr 2012, at 08:43, Ivan Herman wrote: > >>> [[[ > >>> In general, for duplicate rows with identical values, implementations > should use fresh blank nodes for each duplicate row. However, if the > underlying database system does not provide any means to reliably > differentiate among the rows via, eg, row ids, it is acceptable to > implentations to reuse blank nodes. > >>> ]]] > >> > >> I'm ok with that. I would rather remove the mention of ROWIDs, to make > the hidden translation a bit less obvious (“Oracle should implement it with > fresh blank nodes; for everyone else, it is acceptable to re-use the same > blank node for duplicate rows.”) > > > > I am fine if you find a suitable technical term there; or simply drop > the "eg, row ids," > > Let's drop it then. > > >>> I wonder wheter we should not add that in such a case a warning should > also be issued. > >> > >> An implementation would either have to always show the warning, or > never. That's not helpful to anyone. It's also unclear how warnings would > be delivered and to whom. > > > > I am not sure whether warning system is referred to anywhere else in the > doc. But something with MAY is neutral enough. That being said, this is a > side issue. > > Or we could just say that systems SHOULD document/advertise their choice > of implementation strategy. Sending warnings at runtime would be one way of > doing that I suppose ;-) > > >> We could specify two different conformance levels or conformance modes > (lean/non-lean), and make conforming implementations declare explicitly > which one they support. > > > > The original question was whether this would lead to new LC or not. I > think that if we use the formulation above, it is fine to go ahead to PR. > Introducing new conformance modes definitely sends back the document to LC. > I am not sure it is worth it, to be honest. > > I agree, not worth it. To put it all together (with minor rewording): > > PROPOSAL: In the DM spec, replace the following text: > > [[ > If the table has no primary key, the row node is a fresh blank node that > is unique to this row. > ]] > > with this: > > [[ > If the table has no primary key, the row node is a blank node. Distinct > blank nodes MUST be generated for rows with distinct column values. For > duplicate rows with identical values, implementations SHOULD generate a > fresh blank for each duplicate row. However, if the underlying database > system does not provide any means to reliably differentiate among the rows, > then implementations MAY re-use the same blank node for multiple duplicate > rows. Implementations SHOULD document and advertise their chosen behavior. > ]] > > Best, > Richard > > > > > > > Ivan > > > > > >> Best, > >> Richard > >> > >> > >> > >>> > >>> The wording on how to describe the corner case probably needs > refining, but you get what I mean, I guess. > >>> > >>> If that is the only change, I guess it could be argued that such a > change is reflecting implementation experience, and would not constitute a > change warranting a second LC. > >>> > >>> Ivan > >>> > >>> --- > >>> Ivan Herman > >>> Tel:+31 641044153 > >>> http://www.ivan-herman.net > >>> > >>> (Written on mobile, sorry for brevity and misspellings...) > >>> > >>> > >>> > >>> On 25 Apr 2012, at 17:08, Ivan Herman <ivan@w3.org> wrote: > >>> > >>>> The way I read this, and if my understanding is correct, it clarifies > a potential ambiguity in the spec. As Michael put it, this is what CR is > for, and I would not go to another LC for this. > >>>> > >>>> Ivan > >>>> > >>>> On Apr 25, 2012, at 15:48 , ashok malhotra wrote: > >>>> > >>>>> Ivan: > >>>>> We need your guidance on this > >>>>> > >>>>> Re. Whether this needs another Last Call, the proposal is to replace > >>>>> [[ > >>>>> If the table has no primary key, the row node is a fresh blank node > that is unique to this row > >>>>> ]] > >>>>> with this wording: > >>>>> [[ > >>>>> If the table has no primary key, the row node is a blank node. > Distinct blank nodes must be generated for rows with distinct column > values. For duplicate rows with identical values, it is left to the > implementation whether to generate distinct blank nodes for each duplicate > row. > >>>>> ]] > >>>>> > >>>>> As I see it, this offers the implementation additional freedom in a > corner case. > >>>>> Not sure if that constitutes a material change in the semantics. > >>>>> All the best, Ashok > >>>>> > >>>>> On 4/25/2012 6:05 AM, Juan Sequeda wrote: > >>>>>> You got my vote and Marcelo's. So > >>>>>> > >>>>>> +2 > >>>>>> > >>>>>> My question now is... do we have to go back to last call? > >>>>>> > >>>>>> In addition to adding this, we would need to do a minor change in > the appendix to reflect this change. For the Direct Mapping as Rules > section, we would just need to change a bit the definition of > generateRowBlankNode predicate. > >>>>>> > >>>>>> For the Denotational semantics, in line 37 > >>>>>> > >>>>>> [[ > >>>>>> else > >>>>>> a BlankNode unique to r > >>>>>> ]] > >>>>>> > >>>>>> would need to be changed to reflect the change. Not sure exactly > how it would be done. Eric? > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> Juan Sequeda > >>>>>> +1-575-SEQ-UEDA > >>>>>> www.juansequeda.com > >>>>>> > >>>>>> > >>>>>> On Wed, Apr 25, 2012 at 2:52 PM, Richard Cyganiak < > richard@cyganiak.de> wrote: > >>>>>> Hi Juan, > >>>>>> > >>>>>> This direction works for me. I would reword it slightly. How about > replacing the current spec text: > >>>>>> > >>>>>> [[ > >>>>>> If the table has no primary key, the row node is a fresh blank node > that is unique to this row > >>>>>> ]] > >>>>>> > >>>>>> with this wording: > >>>>>> > >>>>>> [[ > >>>>>> If the table has no primary key, the row node is a blank node. > Distinct blank nodes must be generated for rows with distinct column > values. For duplicate rows with identical values, it is left to the > implementation whether to generate distinct blank nodes for each duplicate > row. > >>>>>> ]] > >>>>>> > >>>>>> and adding an informative NOTE: > >>>>>> > >>>>>> [[ > >>>>>> NOTE: In the case of duplicate rows in tables without primary key, > if one blank node is generated for each row, then the result is a > *non-lean* RDF graph [RDF Semantic]. If one blank node is generated for > each distinct set of column values, then the result is a *lean* RDF graph. > The lean version is equivalent to the non-lean version under RDF Semantics, > but does not maintain the relational table's cardinalities, and hence gives > different answers under certain SPARQL queries. The lean version is easily > expressible in R2RML [R2RML]. > >>>>>> ]] > >>>>>> > >>>>>> I think this is the same in spirit as your version, but says less > about implementation concerns, and motivates the two versions more in terms > of compatibility with other specs (SPARQL and R2RML). > >>>>>> > >>>>>> Best, > >>>>>> Richard > >>>>>> > >>>>>> > >>>>>> On 25 Apr 2012, at 09:25, Juan Sequeda wrote: > >>>>>>> What caught my attention was: "let implementers choose whether > they want to implement the lean or non-lean direct mapping." I like how you > phrased that. This would imply that there could be two DM: a lean and > non-lean. > >>>>>>> > >>>>>>> I would propose to change > >>>>>>> > >>>>>>> "If the table has no primary key, the row node is a fresh blank > node that is unique to this row" > >>>>>>> > >>>>>>> to > >>>>>>> > >>>>>>> "If the table has no primary key, the row node is a blank node. " > >>>>>>> > >>>>>> > >>>>>>> And then have a note/warning. > >>>>>>> > >>>>>> > >>>>>>> [[ > >>>>>>> If you generate a fresh blank node that is unique to this row, > then the result is a non-lean RDF graph. > >>>>>>> > >>>>>>> If you generate the same blank node for repeated tuples, then the > result is a lean RDF graph. > >>>>>>> > >>>>>>> The non-lean DM preserves the cardinality of the tuples, but it > hard/inefficient to implement in a SPARQL to SQL translator. > >>>>>>> > >>>>>>> The lean DM does not preserve the cardinality of the tuples, but > the implementation is easier/efficient in a SPARQL to SQL translator. > >>>>>>> > >>>>>>> If you are implementing a dumping tool, the recommendation is to > create a non-lean DM in order to maintain the cardinality. > >>>>>>> ]] > >>>>>> > >>>>>>> > >>>>>>> > >>>>>>> Juan Sequeda > >>>>>>> +1-575-SEQ-UEDA > >>>>>>> www.juansequeda.com > >>>>>>> > >>>>>>> > >>>>>>> On Tue, Apr 24, 2012 at 10:15 PM, Richard Cyganiak < > richard@cyganiak.de> wrote: > >>>>>>> So, Eric challenged me to present an example of a query over a > direct-mapped PK-less table that I believe cannot be evaluated in standard > SQL without materializing the entire table outside of the DB. > >>>>>>> > >>>>>>> First let me say that I've puzzled over this non-PK issue for more > than a day, trying to come up with some scheme based on cursors or ROWNUM > or local variables to make it work, and failed. Now, making a leap from “I > couldn't do it in a day” to “It's impossible” is certainly not quite > appropriate, but after that experience I felt justified to send an > implementation experience report to the WG, stating my belief that the cost > of implementing this scheme are not worth the benefits. Hence my proposal > to let implementers choose whether they want to implement the lean or > non-lean direct mapping. > >>>>>>> > >>>>>>> So here we go. > >>>>>>> > >>>>>>> IOU > >>>>>>> BORROWER | AMOUNT > >>>>>>> ---------+------- > >>>>>>> Alice | 10 > >>>>>>> Bob | 5 > >>>>>>> Charlie | 10 > >>>>>>> Charlie | 10 > >>>>>>> > >>>>>>> The equivalent non-lean direct mapping graph (minus rdf:type > triples): > >>>>>>> > >>>>>>> _:1 <IOU#BORROWER> "Alice". > >>>>>>> _:1 <IOU#AMOUNT> 10. > >>>>>>> _:2 <IOU#BORROWER> "Bob". > >>>>>>> _:2 <IOU#AMOUNT> 5. > >>>>>>> _:3 <IOU#BORROWER> "Charlie". > >>>>>>> _:3 <IOU#AMOUNT> 10. > >>>>>>> _:4 <IOU#BORROWER> "Charlie". > >>>>>>> _:4 <IOU#AMOUNT> 10. > >>>>>>> > >>>>>>> Now here's a simple SPARQL query: > >>>>>>> > >>>>>>> SELECT * { > >>>>>>> { > >>>>>>> ?x <IOU#BORROWER> "Charlie". > >>>>>>> ?x ?property ?value. > >>>>>>> } UNION { > >>>>>>> ?x <IOU#AMOUNT> 10. > >>>>>>> } > >>>>>>> } > >>>>>>> > >>>>>>> The solution should be: > >>>>>>> > >>>>>>> ?x | ?property | ?value > >>>>>>> ----+----------------+---------- > >>>>>>> _:3 | <IOU#BORROWER> | "Charlie" > >>>>>>> _:4 | <IOU#BORROWER> | "Charlie" > >>>>>>> _:3 | <IOU#AMOUNT> | 10 > >>>>>>> _:4 | <IOU#AMOUNT> | 10 > >>>>>>> _:1 | | > >>>>>>> _:3 | | > >>>>>>> _:4 | | > >>>>>>> > >>>>>>> Can you outline an algorithm that produces this result without > materializing the table? (Ordering, the difference between > literals/IRIs/bNodes, and the specific labels for the bNodes don't matter.) > >>>>>>> > >>>>>>> Bonus points if the algorithm is expressed as an R2RML mapping. We > can assume that we already have an algorithm for evaluating any SPARQL > query over an R2RML mapping. > >>>>>>> > >>>>>>> Here's my non-standard solution using ROWID, which only works on > Oracle: > >>>>>>> > >>>>>>> SELECT ROWID x, '<IOU#BORROWER>' property, BORROWER value > >>>>>>> FROM IOU > >>>>>>> WHERE BORROWER='Charlie' > >>>>>>> UNION > >>>>>>> SELECT ROWID x, '<IOU#AMOUNT>' property, AMOUNT value > >>>>>>> FROM IOU > >>>>>>> WHERE BORROWER='Charlie' > >>>>>>> UNION > >>>>>>> SELECT ROWID x, NULL, NULL > >>>>>>> FROM IOU > >>>>>>> WHERE AMOUNT=10 > >>>>>>> > >>>>>>> Earning the R2RML bonus points: > >>>>>>> > >>>>>>> <#map> a rr:TriplesMap; > >>>>>>> rr:logicalTable [ > >>>>>>> rr:sqlQuery "SELECT ROWID, BORROWER, AMOUNT FROM IOU"; > >>>>>>> ]; > >>>>>>> rr:subjectMap [ > >>>>>>> rr:column "ROWID"; > >>>>>>> rr:termType rr:BlankNode > >>>>>>> ]; > >>>>>>> rr:predicateObjectMap [ > >>>>>>> rr:predicate <IOU#BORROWER>; > >>>>>>> rr:objectMap [ rr:column "BORROWER" ]; > >>>>>>> ]; > >>>>>>> rr:predicateObjectMap [ > >>>>>>> rr:predicate <IOU#AMOUNT>; > >>>>>>> rr:objectMap [ rr:column "AMOUNT" ]; > >>>>>>> ]. > >>>>>>> > >>>>>>> Now, how to do this without the ROWID vendor extension??? > >>>>>>> > >>>>>>> > >>>>>>> ---- > >>>>>>> > >>>>>>> For the record. With a lean direct mapping, the desired output > graph would be: > >>>>>>> > >>>>>>> _:1 <IOU#BORROWER> "Alice". > >>>>>>> _:1 <IOU#AMOUNT> 10. > >>>>>>> _:2 <IOU#BORROWER> "Bob". > >>>>>>> _:2 <IOU#AMOUNT> 5. > >>>>>>> _:3 <IOU#BORROWER> "Charlie". > >>>>>>> _:3 <IOU#AMOUNT> 10. > >>>>>>> > >>>>>>> The query result would be: > >>>>>>> > >>>>>>> ?x | ?property | ?value > >>>>>>> ----+----------------+---------- > >>>>>>> _:3 | <IOU#BORROWER> | "Charlie" > >>>>>>> _:3 | <IOU#AMOUNT> | 10 > >>>>>>> _:1 | | > >>>>>>> _:3 | | > >>>>>>> > >>>>>>> The standard-compliant SQL query would be as above, but replace > ROWID with something like (BORROWER || '@@@separator@@@' || AMOUNT), and > add DISTINCT to each SELECT. > >>>>>>> > >>>>>>> The R2RML query would be the same as above with the following > changes: > >>>>>>> > >>>>>>> rr:logicalTable [ > >>>>>>> rr:tableName "IOU"; > >>>>>>> ]; > >>>>>>> rr:subjectMap [ > >>>>>>> rr:template "{BORROWER}@@@separator@@@{AMOUNT}"; > >>>>>>> rr:termType rr:BlankNode; > >>>>>>> ]; > >>>>>>> > >>>>>>> So, implementing the lean direct mapping is not hard using just > standard SQL. > >>>>>>> > >>>>>>> Best, > >>>>>>> Richard > >>>>>>> > >>>>>> > >>>>>> > >>>> > >>>> > >>>> ---- > >>>> Ivan Herman, W3C Semantic Web Activity Lead > >>>> Home: http://www.w3.org/People/Ivan/ > >>>> mobile: +31-641044153 > >>>> FOAF: http://www.ivan-herman.net/foaf.rdf > >>>> > >>>> > >>>> > >>>> > >>>> > >>> > >> > >> > > > > > > ---- > > Ivan Herman, W3C Semantic Web Activity Lead > > Home: http://www.w3.org/People/Ivan/ > > mobile: +31-641044153 > > FOAF: http://www.ivan-herman.net/foaf.rdf > > > > > > > > > > > > >
Received on Thursday, 26 April 2012 10:15:25 UTC