- From: Ivan Herman <ivan@w3.org>
- Date: Fri, 4 May 2012 15:05:43 +0200
- To: Eric Prud'hommeaux <eric@w3.org>
- Cc: Juan Sequeda <juanfederico@gmail.com>, Richard Cyganiak <richard@cyganiak.de>, ashok malhotra <ashok.malhotra@oracle.com>, Michael Hausenblas <michael.hausenblas@deri.org>, W3C RDB2RDF <public-rdb2rdf-wg@w3.org>
Eric, this seems to be a bit drastic for my taste; I would not want to burn the bridges between the R2RML and the DM. The fact that these two are closely related, that, *in general*, the DM is a default case for R2RML is, I believe, a strong feature, a good 'story'. I would not want to loose that. However, we have to face that there *are* cases when things do not really fit. What about modifying the two documents as follows (note that point #2 is not strictly necessary for the discussion at hand, but it makes the relationships even clearer and stronger): 1. In the DM, instead of "is intended to provide a default behavior for R2RML: RDB to RDF Mapping Language" say "is intended to provide a default behavior for R2RML: RDB to RDF Mapping Language for tables which have at least one unique key" 2. Add to the R2RML document (probably in the intro part): "R2RML implementations are encouraged to provide a default mapping equivalent to the Direct Mapping for tables which have at least one unique key" 3. Add a Note to R2RML 6.1: "Because rr:IRI and rr:BlankNode subject labels are generated from column values, R2RML mappings do not preserve repeated rows in SQL databases." How does that sound? Ivan On May 4, 2012, at 13:43 , Eric Prud'hommeaux wrote: > * Juan Sequeda <juanfederico@gmail.com> [2012-05-03 20:04-0500] >> All, >> >> 1) Technically we could (and maybe should) add this to the standard (both >> DM and R2RML) however... >> 2) We just realized about the problem now and somebody (Eric/Richard) came >> up with A solution. The rest of the standard has been built on years of >> experience. If this problem came up now just now, at the last minute, it >> means that nobody cared much about this before. That doesn't mean that they >> won't want it now. But it does mean that we should look into it with more >> detail, given that we know the issue exists. Down the road, we will know if >> it is feasible, etc > > We could move along more quickly if we: > > 1. strike "is intended to provide a default behavior for R2RML: RDB > to RDF Mapping Language" from DM > > 2. add a Note to R2RML 6.1: "Because rr:IRI and rr:BlankNode subject > labels are generated from column values, R2RML mappings do not > preserve repeated rows in SQL databases. > > Adding a per-row blank node identifier in v1.1 will be completely > backward-compatible with v1.0. > > >> Juan Sequeda >> +1-575-SEQ-UEDA >> www.juansequeda.com >> >> >> On Thu, May 3, 2012 at 7:27 PM, Richard Cyganiak <richard@cyganiak.de>wrote: >> >>> Hi Eric, >>> >>> My short response is: The proposal is *optional*. You don't have to >>> implement it. You don't have to use implementations that don't support it. >>> It's just an extra sentence or two in the spec. There is clear guidance >>> which option implementers should support. What harm is there in allowing >>> the option? >>> >>> You offered one argument against providing this optional feature, and >>> that's the point about backwards compatibility. Future WGs may find it >>> difficult to remove this option even if the option becomes obsolete due to >>> a possible R2RML 1.1 update. I'll address this below. >>> >>> On 3 May 2012, at 22:36, Eric Prud'hommeaux wrote: >>>> * ashok malhotra <ashok.malhotra@oracle.com> [2012-05-03 12:22-0700] >>>>> +1 for option 2. Seems less onerous. Eric? >>>> >>>> It pains me that folks see me as obstructionist when I may well be >>>> saving us a 3rd LC. In June of 2006, Fred Zemke spotted a similar >>>> problem in the semantics of SPARQL wich took us six months to fix >>>> <http://www.w3.org/mid/4488B936.10705@oracle.com>. >>> >>> The problem in SPARQL was that it specified that implementations MUST NOT >>> use multiset semantics. >>> >>> The proposal on our table is to RECOMMEND multiset semantics, but state >>> that implementations MAY use set semantics for compatibility. This is not >>> comparable to the SPARQL situation. >>> >>> I also note that the 1st LC period and the CR period have passed without >>> any comments on issues of cardinality. >>> >>>> Speaking with Sam Madden, this seems like less of a corner case than >>>> we originally thought. He and Zemke asserted that while some base >>>> tables may have no uniques, it's more common for views materialized >>>> for performance to preserve only the information required to perform >>>> some aggregates. Before standardization of SQL, some relational DBs >>>> operated on sets, others on multisets, and some (Zemke worked on one >>>> called Britton Lee) preserved repeated rows until one did a >>>> sort. Customers, particularly those using views, had to be very >>>> careful in what order they performed various operations. >>> >>> Well, I can see why customers wouldn't be so happy about this, but it's >>> not quite the same thing here. >>> >>> The order of query operations doesn't matter in the proposed design. >>> SPARQL has multiset semantics, so even if you query a table with discarded >>> duplicates, the query execution is with the usual well-defined SPARQL >>> semantics. It's only in the mapping from non-PK tables to RDF graphs that >>> cardinality is not maintained. >>> >>>> Juan brought up fixing this in v1. It's easy for v1.1 to relax rigid >>>> constraints in v1.0, but most charters promise backward compatibility, >>>> so v1.1 can't impose restrictions not present in v1.0. >>> >>> That all depends on what we write into the spec, doesn't it? The DM spec >>> could state that the permission for discarding duplicate rows may be >>> removed in a future version, provided that a future R2RML adds a way of >>> preserving cardinality on no-PK tables. >>> >>>> Another issue is the performance of very common queries. Under >>>> multiset semantics, any query which either reports the name of an >>>> unnamed row requires the complex dance that Richard and I discussed. >>> >>> Yes, these queries are slow. >>> >>>> OTOH, under set semantics, any query which simply restricts or >>>> projects some row attributes requires a distinct subselect, which is >>>> either memory intensive or requires a sort of the table. >>> >>> Well, you forget about query optimization, see below. >>> >>>> For example, >>>> a simple join to get the addresses of folks with year-old debts: >>>> >>>> SELECT ?name ?city >>>> WHERE { >>>> ?debt <IOUs#name> ?name ; >>>> <IOUs#date> ?date ; >>>> <IOUs#addr> ?addr . >>>> ?addr <Addresses#city> ?city >>>> FILTER (?date < "2011-05-03"^^xsd:date) >>>> } >>>> >>>> multiset SQL translation: >>>> SELECT name, city >>>> FROM IOUs INNER JOIN Addresses ON IOUs.addr=Addresses.ID >>>> WHERE date < "2011-05-03" >>>> >>>> set SQL translation: >>>> SELECT name, city >>>> FROM ( >>>> SELECT DISTINCT name, date, addr, attr4, attr5 >>>> FROM IOUs >>>> ) IOUs INNER JOIN Addresses ON IOUs.addr=Addresses.ID >>>> WHERE date < "2011-05-03" >>> >>> Not having thought about this too hard, the second query doesn't seem >>> particularly bad. Isn't it equivalent to this? >>> >>> SELECT name, city >>> FROM ( >>> SELECT DISTINCT name, date, addr, attr4, attr5 >>> FROM IOUs >>> WHERE date < "2011-05-03" >>> ) IOUs INNER JOIN Addresses ON IOUs.addr=Addresses.ID >>> >>> So the duplicate removal is only necessary over the subset of the table >>> that is actually being returned in the end. The INNER JOIN can also be >>> moved inside the DISTINCT, I think. The DISTINCT should then be O(n log n) >>> where n is the number of result rows, which isn't too bad. >>> >>> IIRC, DISTINCT can be moved up in the algebra tree over most other >>> operations, except for projections (which can usually be done last without >>> much performance impact), aggregates (which require more memory than >>> DISTINCT anyways) and LIMIT (which also limits the memory required for >>> DISTINCT). >>> >>> D2RQ is fairly smart about moving DISTINCTs around before generating the >>> final SQL query. I'd expect that most decent query optimizers are even >>> smarter than what we do. >>> >>>> One could make a pretty good case for preserving the intuitive and >>>> efficient query mapping for such common queries. >>> >>> 1. For many of these common queries, the DISTINCT is done on a reduced >>> intermediate result, or even on the final result set, and not on the input >>> data. So it's not that bad. >>> >>> 2. The strange contortions required for returning subjects may well >>> reverse the argument here. You make unproven assumptions about what queries >>> are common. >>> >>> 3. Again, the proposal is *not* to abandon the cardinality-preserving >>> query mapping. The proposal is to allow another query mapping as well, for >>> compatibility. >>> >>> Best, >>> Richard >>> >>> >>> >>>> >>>> >>>>> All the best, Ashok >>>>> >>>>> On 5/3/2012 12:10 PM, Juan Sequeda wrote: >>>>>> >>>>>> >>>>>> On Thu, May 3, 2012 at 2:01 PM, Richard Cyganiak <richard@cyganiak.de<mailto: >>> richard@cyganiak.de>> wrote: >>>>>> >>>>>> On 3 May 2012, at 17:11, Juan Sequeda wrote: >>>>>>> Do you accept eric's proposal (which hasn't been stated yet): >>>>>>> >>>>>>> 1) Leave DM as-is >>>>>>> 2) Add the following to R2RML >>>>>>> >>>>>>> rr:subjectMap [ >>>>>>> rr:termType rr:RowBlankNode >>>>>>> ]; >>>>>> >>>>>> (I'd prefer calling it rr:BlankNode. The absence of >>> rr:column/rr:template/rr:constant indicates the new behaviour.) >>>>>> >>>>>> This is a new feature that was never discussed before. It's not just >>> a tweak. No existing RDB2RDF mapping language has anything comparable. How >>> to sensibly implement it, is a somewhat open question, AFAIK. Had this been >>> proposed a few months ago, everyone would have said, “sounds like an R2RML >>> 1.1 feature” and we would have postponed it without complaints. >>>>>> >>>>>> The problem at hand is the an incompatibility between two specs, >>> let's call them A and B, in a corner case. Now given these choices: >>>>>> >>>>>> 1) Add a new and somewhat risky feature to spec A, at a time when we >>> thought we were just about to enter PR. Send all implementers of A back to >>> the drawing board. Delay the WG for an indefinite amount of time, over a >>> barely relevant corner case. >>>>>> >>>>>> 2) Relax a constraint in spec B to say you SHOULD implement the >>> “correct” behaviour for this corner case, but MAY also implement another >>> not entirely unreasonable behaviour that is compatible with A as it is. Add >>> some alarming language and say: “We expect future versions of A to remove >>> this limitation.” No implementation changes. Go to PR in three weeks. >>>>>> >>>>>> To me, 2) makes a lot more sense than 1). >>>>>> >>>>>> >>>>>> I agree with Richard. Option 2 seems more reasonable at the moment. >>>>>> >>>>>> We already have other issues to address for a R2RML and DM 1.1 >>> version. This could be part of it. I'm not sure how this works in the >>> standardization process, but as a group, we believe this particular issue >>> is a corner case so it's not imperative to include it in the current >>> version of the standard. However, if users complain about this corner case >>> (we then realize that it isn't a corner case), we realize we were wrong >>> from the beginning. I'm guessing this sometimes (usually?) happens in >>> standards, right? >>>>>> >>>>>> >>>>>> Best, >>>>>> Richard >>>>>> >>>>>> >>>>>> >>>>>>> >>>>>>> >>>>>>> Juan Sequeda >>>>>>> +1-575-SEQ-UEDA >>>>>>> www.juansequeda.com <http://www.juansequeda.com> >>>>>>> >>>>>>> >>>>>>> On Thu, May 3, 2012 at 11:08 AM, Michael Hausenblas < >>> michael.hausenblas@deri.org <mailto:michael.hausenblas@deri.org>> wrote: >>>>>>> >>>>>>>> Were we close to closing R2RML's CR? >>>>>>> >>>>>>> This was the last issue, all other have been resolved in last weeks >>> meeting (see also my comments when I sent out the minutes [1]). Never mind, >>> we're not extending CR but entering a second, rather short LC period. >>>>>>> >>>>>>> Ivan, can you prepare a respective PROPOSAL for next week's meeting >>> please? >>>>>>> >>>>>>> Cheers, >>>>>>> Michael >>>>>>> >>>>>>> [1] >>> http://lists.w3.org/Archives/Public/public-rdb2rdf-wg/2012May/0005.html >>>>>>> >>>>>>> -- >>>>>>> Dr. Michael Hausenblas, Research Fellow >>>>>>> DERI - Digital Enterprise Research Institute >>>>>>> NUIG - National University of Ireland, Galway >>>>>>> Ireland, Europe >>>>>>> Tel.: +353 91 495730 <tel:%2B353%2091%20495730> >>>>>>> WebID: http://sw-app.org/mic.xhtml#i >>>>>>> >>>>>>> On 3 May 2012, at 17:04, Eric Prud'hommeaux wrote: >>>>>>> >>>>>>>> * Juan Sequeda <juanfederico@gmail.com <mailto: >>> juanfederico@gmail.com>> [2012-05-03 10:50-0500] >>>>>>>>> Looks like we have to extend CR till >>>>>>>>> we have implementations for this corner case. >>>>>>>> >>>>>>>> Were we close to closing R2RML's CR? >>>>>>>> >>>>>>>> >>>>>>>>> Juan Sequeda >>>>>>>>> www.juansequeda.com <http://www.juansequeda.com> >>>>>>>>> >>>>>>>>> On May 3, 2012, at 10:42 AM, Richard Cyganiak <richard@cyganiak.de<mailto: >>> richard@cyganiak.de>> wrote: >>>>>>>>> >>>>>>>>>> On 3 May 2012, at 16:25, Eric Prud'hommeaux wrote: >>>>>>>>>>> presumes you can create tables, but yeah, conceptually easier >>> query. >>>>>>>>>> >>>>>>>>>> (It looks like most databases have a proprietary method of adding >>> the indexes that doesn't require write access to the DB.) >>>>>>>>>> >>>>>>>>>>> you can even push the symbol generation down: >>>>>>>>>> >>>>>>>>>> Right. >>>>>>>>>> >>>>>>>>>>>> The big remaining question is: How to handle this in R2RML? >>>>>>>>>>> >>>>>>>>>>> Looking for an analog to: >>>>>>>>>>> rr:subjectMap [ >>>>>>>>>>> rr:column "ROWID"; >>>>>>>>>>> rr:termType rr:BlankNode >>>>>>>>>>> ]; >>>>>>>>>>> I'd propose: >>>>>>>>>>> rr:subjectMap [ >>>>>>>>>>> rr:termType rr:RowBlankNode >>>>>>>>>>> ]; >>>>>>>>>> >>>>>>>>>> That's an option. Even keeping rr:BlankNode would work — the >>> absence of an rr:column/rr:template/rr:constant might signal that a fresh >>> blank node must be allocated for each row. >>>>>>>>>> >>>>>>>>>>> Does that complicate things beyond how much a cardinality >>> requirement necessarily complicates things? >>>>>>>>>> >>>>>>>>>> Well, the spec only needs to define the graph generated by the >>> mapping, so in terms of specification it would be a simple enough change. >>>>>>>>>> >>>>>>>>>> The implications for implementers are quite significant though. >>> It's a new feature, the implementation costs are not trivial, no existing >>> implementation does this (AFAIK), so there's a certain amount of R&D >>> required to show that it's implementable. >>>>>>>>>> >>>>>>>>>> Best, >>>>>>>>>> Richard >>>>>>>> >>>>>>>> -- >>>>>>>> -ericP >>>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> >>>> >>>> -- >>>> -ericP >>>> >>> >>> > > -- > -ericP > ---- Ivan Herman, W3C Semantic Web Activity Lead Home: http://www.w3.org/People/Ivan/ mobile: +31-641044153 FOAF: http://www.ivan-herman.net/foaf.rdf
Received on Friday, 4 May 2012 13:03:11 UTC