Re: Brain teaser for non-PK tables from Souripriya Das on 2012-05-04 (public-rdb2rdf-wg@w3.org from May 2012)

From: Souripriya Das <SOURIPRIYA.DAS@oracle.com>
Date: Fri, 4 May 2012 05:53:07 -0700 (PDT)
To: <eric@w3.org>
Cc: <ashok.malhotra@oracle.com>, <public-rdb2rdf-wg@w3.org>, <juanfederico@gmail.com>, <richard@cyganiak.de>, <michael.hausenblas@deri.org>, <ivan@w3.org>
Message-ID: <5bf7cb6e-3b47-4e30-a861-89bb5845ffe0@default>
I am okay with Eric's proposed actions:

"
  1. strike "is intended to provide a default behavior for R2RML: RDB
     to RDF Mapping Language" from DM

  2. add a Note to R2RML 6.1: "Because rr:IRI and rr:BlankNode subject
     labels are generated from column values, R2RML mappings do not
     preserve repeated rows in SQL databases.
"

Thanks,
- Souri.

----- Original Message -----
From: eric@w3.org
To: juanfederico@gmail.com
Cc: richard@cyganiak.de, ashok.malhotra@oracle.com, michael.hausenblas@deri.org, ivan@w3.org, public-rdb2rdf-wg@w3.org
Sent: Friday, May 4, 2012 7:44:20 AM GMT -05:00 US/Canada Eastern
Subject: Re: Brain teaser for non-PK tables

* Juan Sequeda <juanfederico@gmail.com> [2012-05-03 20:04-0500]
> All,
> 
> 1) Technically we could (and maybe should) add this to the standard (both
> DM and R2RML) however...
> 2) We just realized about the problem now and somebody (Eric/Richard) came
> up with A solution. The rest of the standard has been built on years of
> experience. If this problem came up now just now, at the last minute, it
> means that nobody cared much about this before. That doesn't mean that they
> won't want it now. But it does mean that we should look into it with more
> detail, given that we know the issue exists. Down the road, we will know if
> it is feasible, etc

We could move along more quickly if we:

  1. strike "is intended to provide a default behavior for R2RML: RDB
     to RDF Mapping Language" from DM

  2. add a Note to R2RML 6.1: "Because rr:IRI and rr:BlankNode subject
     labels are generated from column values, R2RML mappings do not
     preserve repeated rows in SQL databases.

Adding a per-row blank node identifier in v1.1 will be completely
backward-compatible with v1.0.


> Juan Sequeda
> +1-575-SEQ-UEDA
> www.juansequeda.com
> 
> 
> On Thu, May 3, 2012 at 7:27 PM, Richard Cyganiak <richard@cyganiak.de>wrote:
> 
> > Hi Eric,
> >
> > My short response is: The proposal is *optional*. You don't have to
> > implement it. You don't have to use implementations that don't support it.
> > It's just an extra sentence or two in the spec. There is clear guidance
> > which option implementers should support. What harm is there in allowing
> > the option?
> >
> > You offered one argument against providing this optional feature, and
> > that's the point about backwards compatibility. Future WGs may find it
> > difficult to remove this option even if the option becomes obsolete due to
> > a possible R2RML 1.1 update. I'll address this below.
> >
> > On 3 May 2012, at 22:36, Eric Prud'hommeaux wrote:
> > > * ashok malhotra <ashok.malhotra@oracle.com> [2012-05-03 12:22-0700]
> > >> +1 for option 2.  Seems less onerous.   Eric?
> > >
> > > It pains me that folks see me as obstructionist when I may well be
> > > saving us a 3rd LC. In June of 2006, Fred Zemke spotted a similar
> > > problem in the semantics of SPARQL wich took us six months to fix
> > > <http://www.w3.org/mid/4488B936.10705@oracle.com>.
> >
> > The problem in SPARQL was that it specified that implementations MUST NOT
> > use multiset semantics.
> >
> > The proposal on our table is to RECOMMEND multiset semantics, but state
> > that implementations MAY use set semantics for compatibility. This is not
> > comparable to the SPARQL situation.
> >
> > I also note that the 1st LC period and the CR period have passed without
> > any comments on issues of cardinality.
> >
> > > Speaking with Sam Madden, this seems like less of a corner case than
> > > we originally thought. He and Zemke asserted that while some base
> > > tables may have no uniques, it's more common for views materialized
> > > for performance to preserve only the information required to perform
> > > some aggregates. Before standardization of SQL, some relational DBs
> > > operated on sets, others on multisets, and some (Zemke worked on one
> > > called Britton Lee) preserved repeated rows until one did a
> > > sort. Customers, particularly those using views, had to be very
> > > careful in what order they performed various operations.
> >
> > Well, I can see why customers wouldn't be so happy about this, but it's
> > not quite the same thing here.
> >
> > The order of query operations doesn't matter in the proposed design.
> > SPARQL has multiset semantics, so even if you query a table with discarded
> > duplicates, the query execution is with the usual well-defined SPARQL
> > semantics. It's only in the mapping from non-PK tables to RDF graphs that
> > cardinality is not maintained.
> >
> > > Juan brought up fixing this in v1. It's easy for v1.1 to relax rigid
> > > constraints in v1.0, but most charters promise backward compatibility,
> > > so v1.1 can't impose restrictions not present in v1.0.
> >
> > That all depends on what we write into the spec, doesn't it? The DM spec
> > could state that the permission for discarding duplicate rows may be
> > removed in a future version, provided that a future R2RML adds a way of
> > preserving cardinality on no-PK tables.
> >
> > > Another issue is the performance of very common queries. Under
> > > multiset semantics, any query which either reports the name of an
> > > unnamed row requires the complex dance that Richard and I discussed.
> >
> > Yes, these queries are slow.
> >
> > > OTOH, under set semantics, any query which simply restricts or
> > > projects some row attributes requires a distinct subselect, which is
> > > either memory intensive or requires a sort of the table.
> >
> > Well, you forget about query optimization, see below.
> >
> > > For example,
> > > a simple join to get the addresses of folks with year-old debts:
> > >
> > >  SELECT ?name ?city
> > >   WHERE {
> > >     ?debt <IOUs#name> ?name ;
> > >           <IOUs#date> ?date ;
> > >           <IOUs#addr> ?addr .
> > >     ?addr <Addresses#city> ?city
> > >     FILTER (?date < "2011-05-03"^^xsd:date)
> > >   }
> > >
> > > multiset SQL translation:
> > >  SELECT name, city
> > >    FROM IOUs INNER JOIN Addresses ON IOUs.addr=Addresses.ID
> > >   WHERE date < "2011-05-03"
> > >
> > > set SQL translation:
> > >  SELECT name, city
> > >    FROM (
> > >      SELECT DISTINCT name, date, addr, attr4, attr5
> > >        FROM IOUs
> > >       ) IOUs INNER JOIN Addresses ON IOUs.addr=Addresses.ID
> > >   WHERE date < "2011-05-03"
> >
> > Not having thought about this too hard, the second query doesn't seem
> > particularly bad. Isn't it equivalent to this?
> >
> >  SELECT name, city
> >   FROM (
> >     SELECT DISTINCT name, date, addr, attr4, attr5
> >       FROM IOUs
> >       WHERE date < "2011-05-03"
> >       ) IOUs INNER JOIN Addresses ON IOUs.addr=Addresses.ID
> >
> > So the duplicate removal is only necessary over the subset of the table
> > that is actually being returned in the end. The INNER JOIN can also be
> > moved inside the DISTINCT, I think. The DISTINCT should then be O(n log n)
> > where n is the number of result rows, which isn't too bad.
> >
> > IIRC, DISTINCT can be moved up in the algebra tree over most other
> > operations, except for projections (which can usually be done last without
> > much performance impact), aggregates (which require more memory than
> > DISTINCT anyways) and LIMIT (which also limits the memory required for
> > DISTINCT).
> >
> > D2RQ is fairly smart about moving DISTINCTs around before generating the
> > final SQL query. I'd expect that most decent query optimizers are even
> > smarter than what we do.
> >
> > > One could make a pretty good case for preserving the intuitive and
> > > efficient query mapping for such common queries.
> >
> > 1. For many of these common queries, the DISTINCT is done on a reduced
> > intermediate result, or even on the final result set, and not on the input
> > data. So it's not that bad.
> >
> > 2. The strange contortions required for returning subjects may well
> > reverse the argument here. You make unproven assumptions about what queries
> > are common.
> >
> > 3. Again, the proposal is *not* to abandon the cardinality-preserving
> > query mapping. The proposal is to allow another query mapping as well, for
> > compatibility.
> >
> > Best,
> > Richard
> >
> >
> >
> > >
> > >
> > >> All the best, Ashok
> > >>
> > >> On 5/3/2012 12:10 PM, Juan Sequeda wrote:
> > >>>
> > >>>
> > >>> On Thu, May 3, 2012 at 2:01 PM, Richard Cyganiak <richard@cyganiak.de<mailto:
> > richard@cyganiak.de>> wrote:
> > >>>
> > >>>   On 3 May 2012, at 17:11, Juan Sequeda wrote:
> > >>>> Do you accept eric's proposal (which hasn't been stated yet):
> > >>>>
> > >>>> 1) Leave DM as-is
> > >>>> 2) Add the following to R2RML
> > >>>>
> > >>>> rr:subjectMap [
> > >>>>    rr:termType rr:RowBlankNode
> > >>>>  ];
> > >>>
> > >>>   (I'd prefer calling it rr:BlankNode. The absence of
> > rr:column/rr:template/rr:constant indicates the new behaviour.)
> > >>>
> > >>>   This is a new feature that was never discussed before. It's not just
> > a tweak. No existing RDB2RDF mapping language has anything comparable. How
> > to sensibly implement it, is a somewhat open question, AFAIK. Had this been
> > proposed a few months ago, everyone would have said, “sounds like an R2RML
> > 1.1 feature” and we would have postponed it without complaints.
> > >>>
> > >>>   The problem at hand is the an incompatibility between two specs,
> > let's call them A and B, in a corner case. Now given these choices:
> > >>>
> > >>>   1) Add a new and somewhat risky feature to spec A, at a time when we
> > thought we were just about to enter PR. Send all implementers of A back to
> > the drawing board. Delay the WG for an indefinite amount of time, over a
> > barely relevant corner case.
> > >>>
> > >>>   2) Relax a constraint in spec B to say you SHOULD implement the
> > “correct” behaviour for this corner case, but MAY also implement another
> > not entirely unreasonable behaviour that is compatible with A as it is. Add
> > some alarming language and say: “We expect future versions of A to remove
> > this limitation.” No implementation changes. Go to PR in three weeks.
> > >>>
> > >>>   To me, 2) makes a lot more sense than 1).
> > >>>
> > >>>
> > >>> I agree with Richard. Option 2 seems more reasonable at the moment.
> > >>>
> > >>> We already have other issues to address for a R2RML and DM 1.1
> > version. This could be part of it. I'm not sure how this works in the
> > standardization process, but as a group, we believe this particular issue
> > is a corner case so it's not imperative to include it in the current
> > version of the standard. However, if users complain about this corner case
> > (we then realize that it isn't a corner case), we realize we were wrong
> > from the beginning. I'm guessing this sometimes (usually?) happens in
> > standards, right?
> > >>>
> > >>>
> > >>>   Best,
> > >>>   Richard
> > >>>
> > >>>
> > >>>
> > >>>>
> > >>>>
> > >>>> Juan Sequeda
> > >>>> +1-575-SEQ-UEDA
> > >>>> www.juansequeda.com <http://www.juansequeda.com>
> > >>>>
> > >>>>
> > >>>> On Thu, May 3, 2012 at 11:08 AM, Michael Hausenblas <
> > michael.hausenblas@deri.org <mailto:michael.hausenblas@deri.org>> wrote:
> > >>>>
> > >>>>> Were we close to closing R2RML's CR?
> > >>>>
> > >>>> This was the last issue, all other have been resolved in last weeks
> > meeting (see also my comments when I sent out the minutes [1]). Never mind,
> > we're not extending CR but entering a second, rather short LC period.
> > >>>>
> > >>>> Ivan, can you prepare a respective PROPOSAL for next week's meeting
> > please?
> > >>>>
> > >>>> Cheers,
> > >>>>          Michael
> > >>>>
> > >>>> [1]
> > http://lists.w3.org/Archives/Public/public-rdb2rdf-wg/2012May/0005.html
> > >>>>
> > >>>> --
> > >>>> Dr. Michael Hausenblas, Research Fellow
> > >>>> DERI - Digital Enterprise Research Institute
> > >>>> NUIG - National University of Ireland, Galway
> > >>>> Ireland, Europe
> > >>>> Tel.: +353 91 495730 <tel:%2B353%2091%20495730>
> > >>>> WebID: http://sw-app.org/mic.xhtml#i
> > >>>>
> > >>>> On 3 May 2012, at 17:04, Eric Prud'hommeaux wrote:
> > >>>>
> > >>>>> * Juan Sequeda <juanfederico@gmail.com <mailto:
> > juanfederico@gmail.com>> [2012-05-03 10:50-0500]
> > >>>>>> Looks like we have to extend CR till
> > >>>>>> we have implementations for this corner case.
> > >>>>>
> > >>>>> Were we close to closing R2RML's CR?
> > >>>>>
> > >>>>>
> > >>>>>> Juan Sequeda
> > >>>>>> www.juansequeda.com <http://www.juansequeda.com>
> > >>>>>>
> > >>>>>> On May 3, 2012, at 10:42 AM, Richard Cyganiak <richard@cyganiak.de<mailto:
> > richard@cyganiak.de>> wrote:
> > >>>>>>
> > >>>>>>> On 3 May 2012, at 16:25, Eric Prud'hommeaux wrote:
> > >>>>>>>> presumes you can create tables, but yeah, conceptually easier
> > query.
> > >>>>>>>
> > >>>>>>> (It looks like most databases have a proprietary method of adding
> > the indexes that doesn't require write access to the DB.)
> > >>>>>>>
> > >>>>>>>> you can even push the symbol generation down:
> > >>>>>>>
> > >>>>>>> Right.
> > >>>>>>>
> > >>>>>>>>> The big remaining question is: How to handle this in R2RML?
> > >>>>>>>>
> > >>>>>>>> Looking for an analog to:
> > >>>>>>>> rr:subjectMap [
> > >>>>>>>>     rr:column "ROWID";
> > >>>>>>>>     rr:termType rr:BlankNode
> > >>>>>>>>  ];
> > >>>>>>>> I'd propose:
> > >>>>>>>> rr:subjectMap [
> > >>>>>>>>     rr:termType rr:RowBlankNode
> > >>>>>>>>  ];
> > >>>>>>>
> > >>>>>>> That's an option. Even keeping rr:BlankNode would work — the
> > absence of an rr:column/rr:template/rr:constant might signal that a fresh
> > blank node must be allocated for each row.
> > >>>>>>>
> > >>>>>>>> Does that complicate things beyond how much a cardinality
> > requirement necessarily complicates things?
> > >>>>>>>
> > >>>>>>> Well, the spec only needs to define the graph generated by the
> > mapping, so in terms of specification it would be a simple enough change.
> > >>>>>>>
> > >>>>>>> The implications for implementers are quite significant though.
> > It's a new feature, the implementation costs are not trivial, no existing
> > implementation does this (AFAIK), so there's a certain amount of R&D
> > required to show that it's implementable.
> > >>>>>>>
> > >>>>>>> Best,
> > >>>>>>> Richard
> > >>>>>
> > >>>>> --
> > >>>>> -ericP
> > >>>>>
> > >>>>
> > >>>>
> > >>>>
> > >>>
> > >>>
> > >
> > > --
> > > -ericP
> > >
> >
> >

-- 
-ericP
Received on Friday, 4 May 2012 12:53:49 UTC