- From: Souripriya Das <SOURIPRIYA.DAS@oracle.com>
- Date: Fri, 4 May 2012 05:53:07 -0700 (PDT)
- To: <eric@w3.org>
- Cc: <ashok.malhotra@oracle.com>, <public-rdb2rdf-wg@w3.org>, <juanfederico@gmail.com>, <richard@cyganiak.de>, <michael.hausenblas@deri.org>, <ivan@w3.org>
I am okay with Eric's proposed actions: " 1. strike "is intended to provide a default behavior for R2RML: RDB to RDF Mapping Language" from DM 2. add a Note to R2RML 6.1: "Because rr:IRI and rr:BlankNode subject labels are generated from column values, R2RML mappings do not preserve repeated rows in SQL databases. " Thanks, - Souri. ----- Original Message ----- From: eric@w3.org To: juanfederico@gmail.com Cc: richard@cyganiak.de, ashok.malhotra@oracle.com, michael.hausenblas@deri.org, ivan@w3.org, public-rdb2rdf-wg@w3.org Sent: Friday, May 4, 2012 7:44:20 AM GMT -05:00 US/Canada Eastern Subject: Re: Brain teaser for non-PK tables * Juan Sequeda <juanfederico@gmail.com> [2012-05-03 20:04-0500] > All, > > 1) Technically we could (and maybe should) add this to the standard (both > DM and R2RML) however... > 2) We just realized about the problem now and somebody (Eric/Richard) came > up with A solution. The rest of the standard has been built on years of > experience. If this problem came up now just now, at the last minute, it > means that nobody cared much about this before. That doesn't mean that they > won't want it now. But it does mean that we should look into it with more > detail, given that we know the issue exists. Down the road, we will know if > it is feasible, etc We could move along more quickly if we: 1. strike "is intended to provide a default behavior for R2RML: RDB to RDF Mapping Language" from DM 2. add a Note to R2RML 6.1: "Because rr:IRI and rr:BlankNode subject labels are generated from column values, R2RML mappings do not preserve repeated rows in SQL databases. Adding a per-row blank node identifier in v1.1 will be completely backward-compatible with v1.0. > Juan Sequeda > +1-575-SEQ-UEDA > www.juansequeda.com > > > On Thu, May 3, 2012 at 7:27 PM, Richard Cyganiak <richard@cyganiak.de>wrote: > > > Hi Eric, > > > > My short response is: The proposal is *optional*. You don't have to > > implement it. You don't have to use implementations that don't support it. > > It's just an extra sentence or two in the spec. There is clear guidance > > which option implementers should support. What harm is there in allowing > > the option? > > > > You offered one argument against providing this optional feature, and > > that's the point about backwards compatibility. Future WGs may find it > > difficult to remove this option even if the option becomes obsolete due to > > a possible R2RML 1.1 update. I'll address this below. > > > > On 3 May 2012, at 22:36, Eric Prud'hommeaux wrote: > > > * ashok malhotra <ashok.malhotra@oracle.com> [2012-05-03 12:22-0700] > > >> +1 for option 2. Seems less onerous. Eric? > > > > > > It pains me that folks see me as obstructionist when I may well be > > > saving us a 3rd LC. In June of 2006, Fred Zemke spotted a similar > > > problem in the semantics of SPARQL wich took us six months to fix > > > <http://www.w3.org/mid/4488B936.10705@oracle.com>. > > > > The problem in SPARQL was that it specified that implementations MUST NOT > > use multiset semantics. > > > > The proposal on our table is to RECOMMEND multiset semantics, but state > > that implementations MAY use set semantics for compatibility. This is not > > comparable to the SPARQL situation. > > > > I also note that the 1st LC period and the CR period have passed without > > any comments on issues of cardinality. > > > > > Speaking with Sam Madden, this seems like less of a corner case than > > > we originally thought. He and Zemke asserted that while some base > > > tables may have no uniques, it's more common for views materialized > > > for performance to preserve only the information required to perform > > > some aggregates. Before standardization of SQL, some relational DBs > > > operated on sets, others on multisets, and some (Zemke worked on one > > > called Britton Lee) preserved repeated rows until one did a > > > sort. Customers, particularly those using views, had to be very > > > careful in what order they performed various operations. > > > > Well, I can see why customers wouldn't be so happy about this, but it's > > not quite the same thing here. > > > > The order of query operations doesn't matter in the proposed design. > > SPARQL has multiset semantics, so even if you query a table with discarded > > duplicates, the query execution is with the usual well-defined SPARQL > > semantics. It's only in the mapping from non-PK tables to RDF graphs that > > cardinality is not maintained. > > > > > Juan brought up fixing this in v1. It's easy for v1.1 to relax rigid > > > constraints in v1.0, but most charters promise backward compatibility, > > > so v1.1 can't impose restrictions not present in v1.0. > > > > That all depends on what we write into the spec, doesn't it? The DM spec > > could state that the permission for discarding duplicate rows may be > > removed in a future version, provided that a future R2RML adds a way of > > preserving cardinality on no-PK tables. > > > > > Another issue is the performance of very common queries. Under > > > multiset semantics, any query which either reports the name of an > > > unnamed row requires the complex dance that Richard and I discussed. > > > > Yes, these queries are slow. > > > > > OTOH, under set semantics, any query which simply restricts or > > > projects some row attributes requires a distinct subselect, which is > > > either memory intensive or requires a sort of the table. > > > > Well, you forget about query optimization, see below. > > > > > For example, > > > a simple join to get the addresses of folks with year-old debts: > > > > > > SELECT ?name ?city > > > WHERE { > > > ?debt <IOUs#name> ?name ; > > > <IOUs#date> ?date ; > > > <IOUs#addr> ?addr . > > > ?addr <Addresses#city> ?city > > > FILTER (?date < "2011-05-03"^^xsd:date) > > > } > > > > > > multiset SQL translation: > > > SELECT name, city > > > FROM IOUs INNER JOIN Addresses ON IOUs.addr=Addresses.ID > > > WHERE date < "2011-05-03" > > > > > > set SQL translation: > > > SELECT name, city > > > FROM ( > > > SELECT DISTINCT name, date, addr, attr4, attr5 > > > FROM IOUs > > > ) IOUs INNER JOIN Addresses ON IOUs.addr=Addresses.ID > > > WHERE date < "2011-05-03" > > > > Not having thought about this too hard, the second query doesn't seem > > particularly bad. Isn't it equivalent to this? > > > > SELECT name, city > > FROM ( > > SELECT DISTINCT name, date, addr, attr4, attr5 > > FROM IOUs > > WHERE date < "2011-05-03" > > ) IOUs INNER JOIN Addresses ON IOUs.addr=Addresses.ID > > > > So the duplicate removal is only necessary over the subset of the table > > that is actually being returned in the end. The INNER JOIN can also be > > moved inside the DISTINCT, I think. The DISTINCT should then be O(n log n) > > where n is the number of result rows, which isn't too bad. > > > > IIRC, DISTINCT can be moved up in the algebra tree over most other > > operations, except for projections (which can usually be done last without > > much performance impact), aggregates (which require more memory than > > DISTINCT anyways) and LIMIT (which also limits the memory required for > > DISTINCT). > > > > D2RQ is fairly smart about moving DISTINCTs around before generating the > > final SQL query. I'd expect that most decent query optimizers are even > > smarter than what we do. > > > > > One could make a pretty good case for preserving the intuitive and > > > efficient query mapping for such common queries. > > > > 1. For many of these common queries, the DISTINCT is done on a reduced > > intermediate result, or even on the final result set, and not on the input > > data. So it's not that bad. > > > > 2. The strange contortions required for returning subjects may well > > reverse the argument here. You make unproven assumptions about what queries > > are common. > > > > 3. Again, the proposal is *not* to abandon the cardinality-preserving > > query mapping. The proposal is to allow another query mapping as well, for > > compatibility. > > > > Best, > > Richard > > > > > > > > > > > > > > >> All the best, Ashok > > >> > > >> On 5/3/2012 12:10 PM, Juan Sequeda wrote: > > >>> > > >>> > > >>> On Thu, May 3, 2012 at 2:01 PM, Richard Cyganiak <richard@cyganiak.de<mailto: > > richard@cyganiak.de>> wrote: > > >>> > > >>> On 3 May 2012, at 17:11, Juan Sequeda wrote: > > >>>> Do you accept eric's proposal (which hasn't been stated yet): > > >>>> > > >>>> 1) Leave DM as-is > > >>>> 2) Add the following to R2RML > > >>>> > > >>>> rr:subjectMap [ > > >>>> rr:termType rr:RowBlankNode > > >>>> ]; > > >>> > > >>> (I'd prefer calling it rr:BlankNode. The absence of > > rr:column/rr:template/rr:constant indicates the new behaviour.) > > >>> > > >>> This is a new feature that was never discussed before. It's not just > > a tweak. No existing RDB2RDF mapping language has anything comparable. How > > to sensibly implement it, is a somewhat open question, AFAIK. Had this been > > proposed a few months ago, everyone would have said, “sounds like an R2RML > > 1.1 feature” and we would have postponed it without complaints. > > >>> > > >>> The problem at hand is the an incompatibility between two specs, > > let's call them A and B, in a corner case. Now given these choices: > > >>> > > >>> 1) Add a new and somewhat risky feature to spec A, at a time when we > > thought we were just about to enter PR. Send all implementers of A back to > > the drawing board. Delay the WG for an indefinite amount of time, over a > > barely relevant corner case. > > >>> > > >>> 2) Relax a constraint in spec B to say you SHOULD implement the > > “correct” behaviour for this corner case, but MAY also implement another > > not entirely unreasonable behaviour that is compatible with A as it is. Add > > some alarming language and say: “We expect future versions of A to remove > > this limitation.” No implementation changes. Go to PR in three weeks. > > >>> > > >>> To me, 2) makes a lot more sense than 1). > > >>> > > >>> > > >>> I agree with Richard. Option 2 seems more reasonable at the moment. > > >>> > > >>> We already have other issues to address for a R2RML and DM 1.1 > > version. This could be part of it. I'm not sure how this works in the > > standardization process, but as a group, we believe this particular issue > > is a corner case so it's not imperative to include it in the current > > version of the standard. However, if users complain about this corner case > > (we then realize that it isn't a corner case), we realize we were wrong > > from the beginning. I'm guessing this sometimes (usually?) happens in > > standards, right? > > >>> > > >>> > > >>> Best, > > >>> Richard > > >>> > > >>> > > >>> > > >>>> > > >>>> > > >>>> Juan Sequeda > > >>>> +1-575-SEQ-UEDA > > >>>> www.juansequeda.com <http://www.juansequeda.com> > > >>>> > > >>>> > > >>>> On Thu, May 3, 2012 at 11:08 AM, Michael Hausenblas < > > michael.hausenblas@deri.org <mailto:michael.hausenblas@deri.org>> wrote: > > >>>> > > >>>>> Were we close to closing R2RML's CR? > > >>>> > > >>>> This was the last issue, all other have been resolved in last weeks > > meeting (see also my comments when I sent out the minutes [1]). Never mind, > > we're not extending CR but entering a second, rather short LC period. > > >>>> > > >>>> Ivan, can you prepare a respective PROPOSAL for next week's meeting > > please? > > >>>> > > >>>> Cheers, > > >>>> Michael > > >>>> > > >>>> [1] > > http://lists.w3.org/Archives/Public/public-rdb2rdf-wg/2012May/0005.html > > >>>> > > >>>> -- > > >>>> Dr. Michael Hausenblas, Research Fellow > > >>>> DERI - Digital Enterprise Research Institute > > >>>> NUIG - National University of Ireland, Galway > > >>>> Ireland, Europe > > >>>> Tel.: +353 91 495730 <tel:%2B353%2091%20495730> > > >>>> WebID: http://sw-app.org/mic.xhtml#i > > >>>> > > >>>> On 3 May 2012, at 17:04, Eric Prud'hommeaux wrote: > > >>>> > > >>>>> * Juan Sequeda <juanfederico@gmail.com <mailto: > > juanfederico@gmail.com>> [2012-05-03 10:50-0500] > > >>>>>> Looks like we have to extend CR till > > >>>>>> we have implementations for this corner case. > > >>>>> > > >>>>> Were we close to closing R2RML's CR? > > >>>>> > > >>>>> > > >>>>>> Juan Sequeda > > >>>>>> www.juansequeda.com <http://www.juansequeda.com> > > >>>>>> > > >>>>>> On May 3, 2012, at 10:42 AM, Richard Cyganiak <richard@cyganiak.de<mailto: > > richard@cyganiak.de>> wrote: > > >>>>>> > > >>>>>>> On 3 May 2012, at 16:25, Eric Prud'hommeaux wrote: > > >>>>>>>> presumes you can create tables, but yeah, conceptually easier > > query. > > >>>>>>> > > >>>>>>> (It looks like most databases have a proprietary method of adding > > the indexes that doesn't require write access to the DB.) > > >>>>>>> > > >>>>>>>> you can even push the symbol generation down: > > >>>>>>> > > >>>>>>> Right. > > >>>>>>> > > >>>>>>>>> The big remaining question is: How to handle this in R2RML? > > >>>>>>>> > > >>>>>>>> Looking for an analog to: > > >>>>>>>> rr:subjectMap [ > > >>>>>>>> rr:column "ROWID"; > > >>>>>>>> rr:termType rr:BlankNode > > >>>>>>>> ]; > > >>>>>>>> I'd propose: > > >>>>>>>> rr:subjectMap [ > > >>>>>>>> rr:termType rr:RowBlankNode > > >>>>>>>> ]; > > >>>>>>> > > >>>>>>> That's an option. Even keeping rr:BlankNode would work — the > > absence of an rr:column/rr:template/rr:constant might signal that a fresh > > blank node must be allocated for each row. > > >>>>>>> > > >>>>>>>> Does that complicate things beyond how much a cardinality > > requirement necessarily complicates things? > > >>>>>>> > > >>>>>>> Well, the spec only needs to define the graph generated by the > > mapping, so in terms of specification it would be a simple enough change. > > >>>>>>> > > >>>>>>> The implications for implementers are quite significant though. > > It's a new feature, the implementation costs are not trivial, no existing > > implementation does this (AFAIK), so there's a certain amount of R&D > > required to show that it's implementable. > > >>>>>>> > > >>>>>>> Best, > > >>>>>>> Richard > > >>>>> > > >>>>> -- > > >>>>> -ericP > > >>>>> > > >>>> > > >>>> > > >>>> > > >>> > > >>> > > > > > > -- > > > -ericP > > > > > > > -- -ericP
Received on Friday, 4 May 2012 12:53:49 UTC