Re: Re 2: Brain teaser for non-PK tables from Juan Sequeda on 2012-04-26 (public-rdb2rdf-wg@w3.org from April 2012)

From: Juan Sequeda <juanfederico@gmail.com>
Date: Thu, 26 Apr 2012 14:05:40 +0200
To: Richard Cyganiak <richard@cyganiak.de>
Cc: Ivan Herman <ivan@w3.org>, "ashok.malhotra@oracle.com" <ashok.malhotra@oracle.com>, "public-rdb2rdf-wg@w3.org" <public-rdb2rdf-wg@w3.org>
Message-ID: <CAMVTWDwJvgaqR7jAf5EzZMuTLLY05yeAy8PyUT6Vpuj1XQdpHA@mail.gmail.com>
exactly!

Juan Sequeda
+1-575-SEQ-UEDA
www.juansequeda.com


On Thu, Apr 26, 2012 at 1:10 PM, Richard Cyganiak <richard@cyganiak.de>wrote:

> Juan,
>
> On 26 Apr 2012, at 11:14, Juan Sequeda wrote:
> > no need to state the effect: lean vs non lean rdf graph
>
> Sorry, but this is very terse and I don't understand what you are saying.
>
> Are you agreeing to the PROPOSAL? Or are you saying something like this
> should be added:
>
> “… implementations SHOULD generate a fresh blank for each duplicate row
> (resulting in a non-lean RDF graph [RDF Semantics]).”
> “… implementations MAY re-use the same blank node for multiple duplicate
> rows (resulting in a lean RDF graph).”
>
> Best,
> Richard
>
>
> >
> > On Thu, Apr 26, 2012 at 12:02 PM, Richard Cyganiak <richard@cyganiak.de>
> wrote:
> > Ivan,
> >
> > See inline for responses and a PROPOSAL.
> >
> > On 26 Apr 2012, at 08:43, Ivan Herman wrote:
> > >>> [[[
> > >>> In general, for duplicate rows with identical values,
> implementations should use fresh blank nodes for each duplicate row.
> However,  if the underlying database system does not provide any means to
> reliably differentiate among the rows via, eg, row ids, it is acceptable to
> implentations to reuse blank nodes.
> > >>> ]]]
> > >>
> > >> I'm ok with that. I would rather remove the mention of ROWIDs, to
> make the hidden translation a bit less obvious (“Oracle should implement it
> with fresh blank nodes; for everyone else, it is acceptable to re-use the
> same blank node for duplicate rows.”)
> > >
> > > I am fine if you find a suitable technical term there; or simply drop
> the "eg, row ids,"
> >
> > Let's drop it then.
> >
> > >>> I wonder wheter we should not add that in such a case a warning
> should also be issued.
> > >>
> > >> An implementation would either have to always show the warning, or
> never. That's not helpful to anyone. It's also unclear how warnings would
> be delivered and to whom.
> > >
> > > I am not sure whether warning system is referred to anywhere else in
> the doc. But something with MAY is neutral enough. That being said, this is
> a side issue.
> >
> > Or we could just say that systems SHOULD document/advertise their choice
> of implementation strategy. Sending warnings at runtime would be one way of
> doing that I suppose ;-)
> >
> > >> We could specify two different conformance levels or conformance
> modes (lean/non-lean), and make conforming implementations declare
> explicitly which one they support.
> > >
> > > The original question was whether this would lead to new LC or not. I
> think that if we use the formulation above, it is fine to go ahead to PR.
> Introducing new conformance modes definitely sends back the document to LC.
> I am not sure it is worth it, to be honest.
> >
> > I agree, not worth it. To put it all together (with minor rewording):
> >
> > PROPOSAL: In the DM spec, replace the following text:
> >
> > [[
> > If the table has no primary key, the row node is a fresh blank node that
> is unique to this row.
> > ]]
> >
> > with this:
> >
> > [[
> > If the table has no primary key, the row node is a blank node. Distinct
> blank nodes MUST be generated for rows with distinct column values. For
> duplicate rows with identical values, implementations SHOULD generate a
> fresh blank for each duplicate row. However, if the underlying database
> system does not provide any means to reliably differentiate among the rows,
> then implementations MAY re-use the same blank node for multiple duplicate
> rows. Implementations SHOULD document and advertise their chosen behavior.
> > ]]
> >
> > Best,
> > Richard
> >
> >
> >
> > >
> > > Ivan
> > >
> > >
> > >> Best,
> > >> Richard
> > >>
> > >>
> > >>
> > >>>
> > >>> The wording on how to describe the corner case probably needs
> refining, but you get what I mean, I guess.
> > >>>
> > >>> If that is the only change, I guess it could be argued that such a
> change is reflecting implementation experience, and would not constitute a
> change warranting a second LC.
> > >>>
> > >>> Ivan
> > >>>
> > >>> ---
> > >>> Ivan Herman
> > >>> Tel:+31 641044153
> > >>> http://www.ivan-herman.net
> > >>>
> > >>> (Written on mobile, sorry for brevity and misspellings...)
> > >>>
> > >>>
> > >>>
> > >>> On 25 Apr 2012, at 17:08, Ivan Herman <ivan@w3.org> wrote:
> > >>>
> > >>>> The way I read this, and if my understanding is correct, it
> clarifies a potential ambiguity in the spec. As Michael put it, this is
> what CR is for, and I would not go to another LC for this.
> > >>>>
> > >>>> Ivan
> > >>>>
> > >>>> On Apr 25, 2012, at 15:48 , ashok malhotra wrote:
> > >>>>
> > >>>>> Ivan:
> > >>>>> We need your guidance on this
> > >>>>>
> > >>>>> Re.  Whether this needs another Last Call, the proposal is to
> replace
> > >>>>> [[
> > >>>>> If the table has no primary key, the row node is a fresh blank
> node that is unique to this row
> > >>>>> ]]
> > >>>>> with this wording:
> > >>>>> [[
> > >>>>> If the table has no primary key, the row node is a blank node.
> Distinct blank nodes must be generated for rows with distinct column
> values. For duplicate rows with identical values, it is left to the
> implementation whether to generate distinct blank nodes for each duplicate
> row.
> > >>>>> ]]
> > >>>>>
> > >>>>> As I see it, this offers the implementation additional freedom in
> a corner case.
> > >>>>> Not sure if that constitutes a material change in the semantics.
> > >>>>> All the best, Ashok
> > >>>>>
> > >>>>> On 4/25/2012 6:05 AM, Juan Sequeda wrote:
> > >>>>>> You got my vote and Marcelo's. So
> > >>>>>>
> > >>>>>> +2
> > >>>>>>
> > >>>>>> My question now is... do we have to go back to last call?
> > >>>>>>
> > >>>>>> In addition to adding this, we would need to do a minor change in
> the appendix to reflect this change. For the Direct Mapping as Rules
> section, we would just need to change a bit the definition of
> generateRowBlankNode predicate.
> > >>>>>>
> > >>>>>> For the Denotational semantics, in line 37
> > >>>>>>
> > >>>>>> [[
> > >>>>>> else
> > >>>>>> a BlankNode unique to r
> > >>>>>> ]]
> > >>>>>>
> > >>>>>> would need to be changed to reflect the change. Not sure exactly
> how it would be done. Eric?
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>> Juan Sequeda
> > >>>>>> +1-575-SEQ-UEDA
> > >>>>>> www.juansequeda.com
> > >>>>>>
> > >>>>>>
> > >>>>>> On Wed, Apr 25, 2012 at 2:52 PM, Richard Cyganiak <
> richard@cyganiak.de> wrote:
> > >>>>>> Hi Juan,
> > >>>>>>
> > >>>>>> This direction works for me. I would reword it slightly. How
> about replacing the current spec text:
> > >>>>>>
> > >>>>>> [[
> > >>>>>> If the table has no primary key, the row node is a fresh blank
> node that is unique to this row
> > >>>>>> ]]
> > >>>>>>
> > >>>>>> with this wording:
> > >>>>>>
> > >>>>>> [[
> > >>>>>> If the table has no primary key, the row node is a blank node.
> Distinct blank nodes must be generated for rows with distinct column
> values. For duplicate rows with identical values, it is left to the
> implementation whether to generate distinct blank nodes for each duplicate
> row.
> > >>>>>> ]]
> > >>>>>>
> > >>>>>> and adding an informative NOTE:
> > >>>>>>
> > >>>>>> [[
> > >>>>>> NOTE: In the case of duplicate rows in tables without primary
> key, if one blank node is generated for each row, then the result is a
> *non-lean* RDF graph [RDF Semantic]. If one blank node is generated for
> each distinct set of column values, then the result is a *lean* RDF graph.
> The lean version is equivalent to the non-lean version under RDF Semantics,
> but does not maintain the relational table's cardinalities, and hence gives
> different answers under certain SPARQL queries. The lean version is easily
> expressible in R2RML [R2RML].
> > >>>>>> ]]
> > >>>>>>
> > >>>>>> I think this is the same in spirit as your version, but says less
> about implementation concerns, and motivates the two versions more in terms
> of compatibility with other specs (SPARQL and R2RML).
> > >>>>>>
> > >>>>>> Best,
> > >>>>>> Richard
> > >>>>>>
> > >>>>>>
> > >>>>>> On 25 Apr 2012, at 09:25, Juan Sequeda wrote:
> > >>>>>>> What caught my attention was: "let implementers choose whether
> they want to implement the lean or non-lean direct mapping." I like how you
> phrased that. This would imply that there could be two DM: a lean and
> non-lean.
> > >>>>>>>
> > >>>>>>> I would propose to change
> > >>>>>>>
> > >>>>>>> "If the table has no primary key, the row node is a fresh blank
> node that is unique to this row"
> > >>>>>>>
> > >>>>>>> to
> > >>>>>>>
> > >>>>>>> "If the table has no primary key, the row node is a blank node. "
> > >>>>>>>
> > >>>>>>
> > >>>>>>> And then have a note/warning.
> > >>>>>>>
> > >>>>>>
> > >>>>>>> [[
> > >>>>>>> If you generate a fresh blank node that is unique to this row,
> then the result is a non-lean RDF graph.
> > >>>>>>>
> > >>>>>>> If you generate the same blank node for repeated tuples, then
> the result is a lean RDF graph.
> > >>>>>>>
> > >>>>>>> The non-lean DM preserves the cardinality of the tuples, but it
> hard/inefficient to implement in a SPARQL to SQL translator.
> > >>>>>>>
> > >>>>>>> The lean DM does not preserve the cardinality of the tuples, but
> the implementation is easier/efficient in a SPARQL to SQL translator.
> > >>>>>>>
> > >>>>>>> If you are implementing a dumping tool, the recommendation is to
> create a non-lean DM in order to maintain the cardinality.
> > >>>>>>> ]]
> > >>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> Juan Sequeda
> > >>>>>>> +1-575-SEQ-UEDA
> > >>>>>>> www.juansequeda.com
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> On Tue, Apr 24, 2012 at 10:15 PM, Richard Cyganiak <
> richard@cyganiak.de> wrote:
> > >>>>>>> So, Eric challenged me to present an example of a query over a
> direct-mapped PK-less table that I believe cannot be evaluated in standard
> SQL without materializing the entire table outside of the DB.
> > >>>>>>>
> > >>>>>>> First let me say that I've puzzled over this non-PK issue for
> more than a day, trying to come up with some scheme based on cursors or
> ROWNUM or local variables to make it work, and failed. Now, making a leap
> from “I couldn't do it in a day” to “It's impossible” is certainly not
> quite appropriate, but after that experience I felt justified to send an
> implementation experience report to the WG, stating my belief that the cost
> of implementing this scheme are not worth the benefits. Hence my proposal
> to let implementers choose whether they want to implement the lean or
> non-lean direct mapping.
> > >>>>>>>
> > >>>>>>> So here we go.
> > >>>>>>>
> > >>>>>>>      IOU
> > >>>>>>> BORROWER | AMOUNT
> > >>>>>>> ---------+-------
> > >>>>>>> Alice    |     10
> > >>>>>>> Bob      |      5
> > >>>>>>> Charlie  |     10
> > >>>>>>> Charlie  |     10
> > >>>>>>>
> > >>>>>>> The equivalent non-lean direct mapping graph (minus rdf:type
> triples):
> > >>>>>>>
> > >>>>>>> _:1 <IOU#BORROWER> "Alice".
> > >>>>>>> _:1 <IOU#AMOUNT> 10.
> > >>>>>>> _:2 <IOU#BORROWER> "Bob".
> > >>>>>>> _:2 <IOU#AMOUNT> 5.
> > >>>>>>> _:3 <IOU#BORROWER> "Charlie".
> > >>>>>>> _:3 <IOU#AMOUNT> 10.
> > >>>>>>> _:4 <IOU#BORROWER> "Charlie".
> > >>>>>>> _:4 <IOU#AMOUNT> 10.
> > >>>>>>>
> > >>>>>>> Now here's a simple SPARQL query:
> > >>>>>>>
> > >>>>>>> SELECT * {
> > >>>>>>>  {
> > >>>>>>>     ?x <IOU#BORROWER> "Charlie".
> > >>>>>>>     ?x ?property ?value.
> > >>>>>>>  } UNION {
> > >>>>>>>     ?x <IOU#AMOUNT> 10.
> > >>>>>>>  }
> > >>>>>>> }
> > >>>>>>>
> > >>>>>>> The solution should be:
> > >>>>>>>
> > >>>>>>> ?x  | ?property      | ?value
> > >>>>>>> ----+----------------+----------
> > >>>>>>> _:3 | <IOU#BORROWER> | "Charlie"
> > >>>>>>> _:4 | <IOU#BORROWER> | "Charlie"
> > >>>>>>> _:3 | <IOU#AMOUNT>   | 10
> > >>>>>>> _:4 | <IOU#AMOUNT>   | 10
> > >>>>>>> _:1 |                |
> > >>>>>>> _:3 |                |
> > >>>>>>> _:4 |                |
> > >>>>>>>
> > >>>>>>> Can you outline an algorithm that produces this result without
> materializing the table? (Ordering, the difference between
> literals/IRIs/bNodes, and the specific labels for the bNodes don't matter.)
> > >>>>>>>
> > >>>>>>> Bonus points if the algorithm is expressed as an R2RML mapping.
> We can assume that we already have an algorithm for evaluating any SPARQL
> query over an R2RML mapping.
> > >>>>>>>
> > >>>>>>> Here's my non-standard solution using ROWID, which only works on
> Oracle:
> > >>>>>>>
> > >>>>>>> SELECT ROWID x, '<IOU#BORROWER>' property, BORROWER value
> > >>>>>>>     FROM IOU
> > >>>>>>>     WHERE BORROWER='Charlie'
> > >>>>>>> UNION
> > >>>>>>> SELECT ROWID x, '<IOU#AMOUNT>' property, AMOUNT value
> > >>>>>>>     FROM IOU
> > >>>>>>>     WHERE BORROWER='Charlie'
> > >>>>>>> UNION
> > >>>>>>> SELECT ROWID x, NULL, NULL
> > >>>>>>>     FROM IOU
> > >>>>>>>     WHERE AMOUNT=10
> > >>>>>>>
> > >>>>>>> Earning the R2RML bonus points:
> > >>>>>>>
> > >>>>>>> <#map> a rr:TriplesMap;
> > >>>>>>>  rr:logicalTable [
> > >>>>>>>     rr:sqlQuery "SELECT ROWID, BORROWER, AMOUNT FROM IOU";
> > >>>>>>>  ];
> > >>>>>>>  rr:subjectMap [
> > >>>>>>>     rr:column "ROWID";
> > >>>>>>>     rr:termType rr:BlankNode
> > >>>>>>>  ];
> > >>>>>>>  rr:predicateObjectMap [
> > >>>>>>>     rr:predicate <IOU#BORROWER>;
> > >>>>>>>     rr:objectMap [ rr:column "BORROWER" ];
> > >>>>>>>  ];
> > >>>>>>>  rr:predicateObjectMap [
> > >>>>>>>     rr:predicate <IOU#AMOUNT>;
> > >>>>>>>     rr:objectMap [ rr:column "AMOUNT" ];
> > >>>>>>>  ].
> > >>>>>>>
> > >>>>>>> Now, how to do this without the ROWID vendor extension???
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> ----
> > >>>>>>>
> > >>>>>>> For the record. With a lean direct mapping, the desired output
> graph would be:
> > >>>>>>>
> > >>>>>>> _:1 <IOU#BORROWER> "Alice".
> > >>>>>>> _:1 <IOU#AMOUNT> 10.
> > >>>>>>> _:2 <IOU#BORROWER> "Bob".
> > >>>>>>> _:2 <IOU#AMOUNT> 5.
> > >>>>>>> _:3 <IOU#BORROWER> "Charlie".
> > >>>>>>> _:3 <IOU#AMOUNT> 10.
> > >>>>>>>
> > >>>>>>> The query result would be:
> > >>>>>>>
> > >>>>>>> ?x  | ?property      | ?value
> > >>>>>>> ----+----------------+----------
> > >>>>>>> _:3 | <IOU#BORROWER> | "Charlie"
> > >>>>>>> _:3 | <IOU#AMOUNT>   | 10
> > >>>>>>> _:1 |                |
> > >>>>>>> _:3 |                |
> > >>>>>>>
> > >>>>>>> The standard-compliant SQL query would be as above, but replace
> ROWID with something like (BORROWER || '@@@separator@@@' || AMOUNT), and
> add DISTINCT to each SELECT.
> > >>>>>>>
> > >>>>>>> The R2RML query would be the same as above with the following
> changes:
> > >>>>>>>
> > >>>>>>>  rr:logicalTable [
> > >>>>>>>     rr:tableName "IOU";
> > >>>>>>>  ];
> > >>>>>>>  rr:subjectMap [
> > >>>>>>>     rr:template "{BORROWER}@@@separator@@@{AMOUNT}";
> > >>>>>>>     rr:termType rr:BlankNode;
> > >>>>>>>  ];
> > >>>>>>>
> > >>>>>>> So, implementing the lean direct mapping is not hard using just
> standard SQL.
> > >>>>>>>
> > >>>>>>> Best,
> > >>>>>>> Richard
> > >>>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>
> > >>>>
> > >>>> ----
> > >>>> Ivan Herman, W3C Semantic Web Activity Lead
> > >>>> Home: http://www.w3.org/People/Ivan/
> > >>>> mobile: +31-641044153
> > >>>> FOAF: http://www.ivan-herman.net/foaf.rdf
> > >>>>
> > >>>>
> > >>>>
> > >>>>
> > >>>>
> > >>>
> > >>
> > >>
> > >
> > >
> > > ----
> > > Ivan Herman, W3C Semantic Web Activity Lead
> > > Home: http://www.w3.org/People/Ivan/
> > > mobile: +31-641044153
> > > FOAF: http://www.ivan-herman.net/foaf.rdf
> > >
> > >
> > >
> > >
> > >
> >
> >
> >
>
>
Received on Thursday, 26 April 2012 12:06:34 UTC