- From: Ivan Herman <ivan@w3.org>
- Date: Fri, 4 May 2012 16:05:27 +0200
- To: Souripriya Das <SOURIPRIYA.DAS@oracle.com>
- Cc: <ashok.malhotra@oracle.com>, <public-rdb2rdf-wg@w3.org>, <juanfederico@gmail.com>, <richard@cyganiak.de>, <michael.hausenblas@deri.org>, <eric@w3.org>
I defer, on that detail, to those of you who know these things better:-) Ivan On May 4, 2012, at 15:32 , Souripriya Das wrote: > Since some commercial DBs allow NULL unique key, > please consider replacing > "tables which have at least one unique key" > with > "tables which have at least one non-NULL unique key" > > Thanks, > - Souri. > > ----- Original Message ----- > From: ivan@w3.org > To: juanfederico@gmail.com > Cc: eric@w3.org, richard@cyganiak.de, ashok.malhotra@oracle.com, michael.hausenblas@deri.org, public-rdb2rdf-wg@w3.org > Sent: Friday, May 4, 2012 9:22:32 AM GMT -05:00 US/Canada Eastern > Subject: Re: Brain teaser for non-PK tables > > > On May 4, 2012, at 15:18 , Juan Sequeda wrote: > >> This means that we would leave the DM as-is, right? > > On the technical side, yes. These changes are clarifications/editorial. > > Ivan > >> >> >> Juan Sequeda >> +1-575-SEQ-UEDA >> www.juansequeda.com >> >> >> On Fri, May 4, 2012 at 8:15 AM, Eric Prud'hommeaux <eric@w3.org> wrote: >> * Juan Sequeda <juanfederico@gmail.com> [2012-05-04 08:10-0500] >>> On Fri, May 4, 2012 at 8:05 AM, Ivan Herman <ivan@w3.org> wrote: >>> >>>> Eric, >>>> >>>> this seems to be a bit drastic for my taste; I would not want to burn the >>>> bridges between the R2RML and the DM. The fact that these two are closely >>>> related, that, *in general*, the DM is a default case for R2RML is, I >>>> believe, a strong feature, a good 'story'. I would not want to loose that. >>>> >>>> However, we have to face that there *are* cases when things do not really >>>> fit. What about modifying the two documents as follows (note that point #2 >>>> is not strictly necessary for the discussion at hand, but it makes the >>>> relationships even clearer and stronger): >>>> >>>> 1. In the DM, instead of "is intended to provide a default behavior for >>>> R2RML: RDB to RDF Mapping Language" say "is intended to provide a default >>>> behavior for R2RML: RDB to RDF Mapping Language for tables which have at >>>> least one unique key" >>>> >>> >>> +1 >> >> +1 >> >>>> 2. Add to the R2RML document (probably in the intro part): "R2RML >>>> implementations are encouraged to provide a default mapping equivalent to >>>> the Direct Mapping for tables which have at least one unique key" >>>> >>> >>> +1 >> >> +1 >> >>>> 3. Add a Note to R2RML 6.1: "Because rr:IRI and rr:BlankNode subject >>>> labels are generated from column values, R2RML mappings do not preserve >>>> repeated rows in SQL databases." >>>> >>> >>> +1 >> >> +1 >> >>>> How does that sound? >>>> >>>> Ivan >>>> >>>> On May 4, 2012, at 13:43 , Eric Prud'hommeaux wrote: >>>> >>>>> * Juan Sequeda <juanfederico@gmail.com> [2012-05-03 20:04-0500] >>>>>> All, >>>>>> >>>>>> 1) Technically we could (and maybe should) add this to the standard >>>> (both >>>>>> DM and R2RML) however... >>>>>> 2) We just realized about the problem now and somebody (Eric/Richard) >>>> came >>>>>> up with A solution. The rest of the standard has been built on years of >>>>>> experience. If this problem came up now just now, at the last minute, it >>>>>> means that nobody cared much about this before. That doesn't mean that >>>> they >>>>>> won't want it now. But it does mean that we should look into it with >>>> more >>>>>> detail, given that we know the issue exists. Down the road, we will >>>> know if >>>>>> it is feasible, etc >>>>> >>>>> We could move along more quickly if we: >>>>> >>>>> 1. strike "is intended to provide a default behavior for R2RML: RDB >>>>> to RDF Mapping Language" from DM >>>>> >>>>> 2. add a Note to R2RML 6.1: "Because rr:IRI and rr:BlankNode subject >>>>> labels are generated from column values, R2RML mappings do not >>>>> preserve repeated rows in SQL databases. >>>>> >>>>> Adding a per-row blank node identifier in v1.1 will be completely >>>>> backward-compatible with v1.0. >>>>> >>>>> >>>>>> Juan Sequeda >>>>>> +1-575-SEQ-UEDA >>>>>> www.juansequeda.com >>>>>> >>>>>> >>>>>> On Thu, May 3, 2012 at 7:27 PM, Richard Cyganiak <richard@cyganiak.de >>>>> wrote: >>>>>> >>>>>>> Hi Eric, >>>>>>> >>>>>>> My short response is: The proposal is *optional*. You don't have to >>>>>>> implement it. You don't have to use implementations that don't support >>>> it. >>>>>>> It's just an extra sentence or two in the spec. There is clear guidance >>>>>>> which option implementers should support. What harm is there in >>>> allowing >>>>>>> the option? >>>>>>> >>>>>>> You offered one argument against providing this optional feature, and >>>>>>> that's the point about backwards compatibility. Future WGs may find it >>>>>>> difficult to remove this option even if the option becomes obsolete >>>> due to >>>>>>> a possible R2RML 1.1 update. I'll address this below. >>>>>>> >>>>>>> On 3 May 2012, at 22:36, Eric Prud'hommeaux wrote: >>>>>>>> * ashok malhotra <ashok.malhotra@oracle.com> [2012-05-03 12:22-0700] >>>>>>>>> +1 for option 2. Seems less onerous. Eric? >>>>>>>> >>>>>>>> It pains me that folks see me as obstructionist when I may well be >>>>>>>> saving us a 3rd LC. In June of 2006, Fred Zemke spotted a similar >>>>>>>> problem in the semantics of SPARQL wich took us six months to fix >>>>>>>> <http://www.w3.org/mid/4488B936.10705@oracle.com>. >>>>>>> >>>>>>> The problem in SPARQL was that it specified that implementations MUST >>>> NOT >>>>>>> use multiset semantics. >>>>>>> >>>>>>> The proposal on our table is to RECOMMEND multiset semantics, but state >>>>>>> that implementations MAY use set semantics for compatibility. This is >>>> not >>>>>>> comparable to the SPARQL situation. >>>>>>> >>>>>>> I also note that the 1st LC period and the CR period have passed >>>> without >>>>>>> any comments on issues of cardinality. >>>>>>> >>>>>>>> Speaking with Sam Madden, this seems like less of a corner case than >>>>>>>> we originally thought. He and Zemke asserted that while some base >>>>>>>> tables may have no uniques, it's more common for views materialized >>>>>>>> for performance to preserve only the information required to perform >>>>>>>> some aggregates. Before standardization of SQL, some relational DBs >>>>>>>> operated on sets, others on multisets, and some (Zemke worked on one >>>>>>>> called Britton Lee) preserved repeated rows until one did a >>>>>>>> sort. Customers, particularly those using views, had to be very >>>>>>>> careful in what order they performed various operations. >>>>>>> >>>>>>> Well, I can see why customers wouldn't be so happy about this, but it's >>>>>>> not quite the same thing here. >>>>>>> >>>>>>> The order of query operations doesn't matter in the proposed design. >>>>>>> SPARQL has multiset semantics, so even if you query a table with >>>> discarded >>>>>>> duplicates, the query execution is with the usual well-defined SPARQL >>>>>>> semantics. It's only in the mapping from non-PK tables to RDF graphs >>>> that >>>>>>> cardinality is not maintained. >>>>>>> >>>>>>>> Juan brought up fixing this in v1. It's easy for v1.1 to relax rigid >>>>>>>> constraints in v1.0, but most charters promise backward compatibility, >>>>>>>> so v1.1 can't impose restrictions not present in v1.0. >>>>>>> >>>>>>> That all depends on what we write into the spec, doesn't it? The DM >>>> spec >>>>>>> could state that the permission for discarding duplicate rows may be >>>>>>> removed in a future version, provided that a future R2RML adds a way of >>>>>>> preserving cardinality on no-PK tables. >>>>>>> >>>>>>>> Another issue is the performance of very common queries. Under >>>>>>>> multiset semantics, any query which either reports the name of an >>>>>>>> unnamed row requires the complex dance that Richard and I discussed. >>>>>>> >>>>>>> Yes, these queries are slow. >>>>>>> >>>>>>>> OTOH, under set semantics, any query which simply restricts or >>>>>>>> projects some row attributes requires a distinct subselect, which is >>>>>>>> either memory intensive or requires a sort of the table. >>>>>>> >>>>>>> Well, you forget about query optimization, see below. >>>>>>> >>>>>>>> For example, >>>>>>>> a simple join to get the addresses of folks with year-old debts: >>>>>>>> >>>>>>>> SELECT ?name ?city >>>>>>>> WHERE { >>>>>>>> ?debt <IOUs#name> ?name ; >>>>>>>> <IOUs#date> ?date ; >>>>>>>> <IOUs#addr> ?addr . >>>>>>>> ?addr <Addresses#city> ?city >>>>>>>> FILTER (?date < "2011-05-03"^^xsd:date) >>>>>>>> } >>>>>>>> >>>>>>>> multiset SQL translation: >>>>>>>> SELECT name, city >>>>>>>> FROM IOUs INNER JOIN Addresses ON IOUs.addr=Addresses.ID >>>>>>>> WHERE date < "2011-05-03" >>>>>>>> >>>>>>>> set SQL translation: >>>>>>>> SELECT name, city >>>>>>>> FROM ( >>>>>>>> SELECT DISTINCT name, date, addr, attr4, attr5 >>>>>>>> FROM IOUs >>>>>>>> ) IOUs INNER JOIN Addresses ON IOUs.addr=Addresses.ID >>>>>>>> WHERE date < "2011-05-03" >>>>>>> >>>>>>> Not having thought about this too hard, the second query doesn't seem >>>>>>> particularly bad. Isn't it equivalent to this? >>>>>>> >>>>>>> SELECT name, city >>>>>>> FROM ( >>>>>>> SELECT DISTINCT name, date, addr, attr4, attr5 >>>>>>> FROM IOUs >>>>>>> WHERE date < "2011-05-03" >>>>>>> ) IOUs INNER JOIN Addresses ON IOUs.addr=Addresses.ID >>>>>>> >>>>>>> So the duplicate removal is only necessary over the subset of the table >>>>>>> that is actually being returned in the end. The INNER JOIN can also be >>>>>>> moved inside the DISTINCT, I think. The DISTINCT should then be O(n >>>> log n) >>>>>>> where n is the number of result rows, which isn't too bad. >>>>>>> >>>>>>> IIRC, DISTINCT can be moved up in the algebra tree over most other >>>>>>> operations, except for projections (which can usually be done last >>>> without >>>>>>> much performance impact), aggregates (which require more memory than >>>>>>> DISTINCT anyways) and LIMIT (which also limits the memory required for >>>>>>> DISTINCT). >>>>>>> >>>>>>> D2RQ is fairly smart about moving DISTINCTs around before generating >>>> the >>>>>>> final SQL query. I'd expect that most decent query optimizers are even >>>>>>> smarter than what we do. >>>>>>> >>>>>>>> One could make a pretty good case for preserving the intuitive and >>>>>>>> efficient query mapping for such common queries. >>>>>>> >>>>>>> 1. For many of these common queries, the DISTINCT is done on a reduced >>>>>>> intermediate result, or even on the final result set, and not on the >>>> input >>>>>>> data. So it's not that bad. >>>>>>> >>>>>>> 2. The strange contortions required for returning subjects may well >>>>>>> reverse the argument here. You make unproven assumptions about what >>>> queries >>>>>>> are common. >>>>>>> >>>>>>> 3. Again, the proposal is *not* to abandon the cardinality-preserving >>>>>>> query mapping. The proposal is to allow another query mapping as well, >>>> for >>>>>>> compatibility. >>>>>>> >>>>>>> Best, >>>>>>> Richard >>>>>>> >>>>>>> >>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>> All the best, Ashok >>>>>>>>> >>>>>>>>> On 5/3/2012 12:10 PM, Juan Sequeda wrote: >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Thu, May 3, 2012 at 2:01 PM, Richard Cyganiak < >>>> richard@cyganiak.de<mailto: >>>>>>> richard@cyganiak.de>> wrote: >>>>>>>>>> >>>>>>>>>> On 3 May 2012, at 17:11, Juan Sequeda wrote: >>>>>>>>>>> Do you accept eric's proposal (which hasn't been stated yet): >>>>>>>>>>> >>>>>>>>>>> 1) Leave DM as-is >>>>>>>>>>> 2) Add the following to R2RML >>>>>>>>>>> >>>>>>>>>>> rr:subjectMap [ >>>>>>>>>>> rr:termType rr:RowBlankNode >>>>>>>>>>> ]; >>>>>>>>>> >>>>>>>>>> (I'd prefer calling it rr:BlankNode. The absence of >>>>>>> rr:column/rr:template/rr:constant indicates the new behaviour.) >>>>>>>>>> >>>>>>>>>> This is a new feature that was never discussed before. It's not >>>> just >>>>>>> a tweak. No existing RDB2RDF mapping language has anything comparable. >>>> How >>>>>>> to sensibly implement it, is a somewhat open question, AFAIK. Had this >>>> been >>>>>>> proposed a few months ago, everyone would have said, “sounds like an >>>> R2RML >>>>>>> 1.1 feature” and we would have postponed it without complaints. >>>>>>>>>> >>>>>>>>>> The problem at hand is the an incompatibility between two specs, >>>>>>> let's call them A and B, in a corner case. Now given these choices: >>>>>>>>>> >>>>>>>>>> 1) Add a new and somewhat risky feature to spec A, at a time when >>>> we >>>>>>> thought we were just about to enter PR. Send all implementers of A >>>> back to >>>>>>> the drawing board. Delay the WG for an indefinite amount of time, over >>>> a >>>>>>> barely relevant corner case. >>>>>>>>>> >>>>>>>>>> 2) Relax a constraint in spec B to say you SHOULD implement the >>>>>>> “correct” behaviour for this corner case, but MAY also implement >>>> another >>>>>>> not entirely unreasonable behaviour that is compatible with A as it >>>> is. Add >>>>>>> some alarming language and say: “We expect future versions of A to >>>> remove >>>>>>> this limitation.” No implementation changes. Go to PR in three weeks. >>>>>>>>>> >>>>>>>>>> To me, 2) makes a lot more sense than 1). >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> I agree with Richard. Option 2 seems more reasonable at the moment. >>>>>>>>>> >>>>>>>>>> We already have other issues to address for a R2RML and DM 1.1 >>>>>>> version. This could be part of it. I'm not sure how this works in the >>>>>>> standardization process, but as a group, we believe this particular >>>> issue >>>>>>> is a corner case so it's not imperative to include it in the current >>>>>>> version of the standard. However, if users complain about this corner >>>> case >>>>>>> (we then realize that it isn't a corner case), we realize we were wrong >>>>>>> from the beginning. I'm guessing this sometimes (usually?) happens in >>>>>>> standards, right? >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Best, >>>>>>>>>> Richard >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Juan Sequeda >>>>>>>>>>> +1-575-SEQ-UEDA >>>>>>>>>>> www.juansequeda.com <http://www.juansequeda.com> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Thu, May 3, 2012 at 11:08 AM, Michael Hausenblas < >>>>>>> michael.hausenblas@deri.org <mailto:michael.hausenblas@deri.org>> >>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> Were we close to closing R2RML's CR? >>>>>>>>>>> >>>>>>>>>>> This was the last issue, all other have been resolved in last weeks >>>>>>> meeting (see also my comments when I sent out the minutes [1]). Never >>>> mind, >>>>>>> we're not extending CR but entering a second, rather short LC period. >>>>>>>>>>> >>>>>>>>>>> Ivan, can you prepare a respective PROPOSAL for next week's meeting >>>>>>> please? >>>>>>>>>>> >>>>>>>>>>> Cheers, >>>>>>>>>>> Michael >>>>>>>>>>> >>>>>>>>>>> [1] >>>>>>> >>>> http://lists.w3.org/Archives/Public/public-rdb2rdf-wg/2012May/0005.html >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> Dr. Michael Hausenblas, Research Fellow >>>>>>>>>>> DERI - Digital Enterprise Research Institute >>>>>>>>>>> NUIG - National University of Ireland, Galway >>>>>>>>>>> Ireland, Europe >>>>>>>>>>> Tel.: +353 91 495730 <tel:%2B353%2091%20495730> >>>>>>>>>>> WebID: http://sw-app.org/mic.xhtml#i >>>>>>>>>>> >>>>>>>>>>> On 3 May 2012, at 17:04, Eric Prud'hommeaux wrote: >>>>>>>>>>> >>>>>>>>>>>> * Juan Sequeda <juanfederico@gmail.com <mailto: >>>>>>> juanfederico@gmail.com>> [2012-05-03 10:50-0500] >>>>>>>>>>>>> Looks like we have to extend CR till >>>>>>>>>>>>> we have implementations for this corner case. >>>>>>>>>>>> >>>>>>>>>>>> Were we close to closing R2RML's CR? >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>> Juan Sequeda >>>>>>>>>>>>> www.juansequeda.com <http://www.juansequeda.com> >>>>>>>>>>>>> >>>>>>>>>>>>> On May 3, 2012, at 10:42 AM, Richard Cyganiak < >>>> richard@cyganiak.de<mailto: >>>>>>> richard@cyganiak.de>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> On 3 May 2012, at 16:25, Eric Prud'hommeaux wrote: >>>>>>>>>>>>>>> presumes you can create tables, but yeah, conceptually easier >>>>>>> query. >>>>>>>>>>>>>> >>>>>>>>>>>>>> (It looks like most databases have a proprietary method of >>>> adding >>>>>>> the indexes that doesn't require write access to the DB.) >>>>>>>>>>>>>> >>>>>>>>>>>>>>> you can even push the symbol generation down: >>>>>>>>>>>>>> >>>>>>>>>>>>>> Right. >>>>>>>>>>>>>> >>>>>>>>>>>>>>>> The big remaining question is: How to handle this in R2RML? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Looking for an analog to: >>>>>>>>>>>>>>> rr:subjectMap [ >>>>>>>>>>>>>>> rr:column "ROWID"; >>>>>>>>>>>>>>> rr:termType rr:BlankNode >>>>>>>>>>>>>>> ]; >>>>>>>>>>>>>>> I'd propose: >>>>>>>>>>>>>>> rr:subjectMap [ >>>>>>>>>>>>>>> rr:termType rr:RowBlankNode >>>>>>>>>>>>>>> ]; >>>>>>>>>>>>>> >>>>>>>>>>>>>> That's an option. Even keeping rr:BlankNode would work — the >>>>>>> absence of an rr:column/rr:template/rr:constant might signal that a >>>> fresh >>>>>>> blank node must be allocated for each row. >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Does that complicate things beyond how much a cardinality >>>>>>> requirement necessarily complicates things? >>>>>>>>>>>>>> >>>>>>>>>>>>>> Well, the spec only needs to define the graph generated by the >>>>>>> mapping, so in terms of specification it would be a simple enough >>>> change. >>>>>>>>>>>>>> >>>>>>>>>>>>>> The implications for implementers are quite significant though. >>>>>>> It's a new feature, the implementation costs are not trivial, no >>>> existing >>>>>>> implementation does this (AFAIK), so there's a certain amount of R&D >>>>>>> required to show that it's implementable. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Best, >>>>>>>>>>>>>> Richard >>>>>>>>>>>> >>>>>>>>>>>> -- >>>>>>>>>>>> -ericP >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> -ericP >>>>>>>> >>>>>>> >>>>>>> >>>>> >>>>> -- >>>>> -ericP >>>>> >>>> >>>> >>>> ---- >>>> Ivan Herman, W3C Semantic Web Activity Lead >>>> Home: http://www.w3.org/People/Ivan/ >>>> mobile: +31-641044153 >>>> FOAF: http://www.ivan-herman.net/foaf.rdf >>>> >>>> >>>> >>>> >>>> >>>> >> >> -- >> -ericP >> > > > ---- > Ivan Herman, W3C Semantic Web Activity Lead > Home: http://www.w3.org/People/Ivan/ > mobile: +31-641044153 > FOAF: http://www.ivan-herman.net/foaf.rdf > > > > > > > ---- Ivan Herman, W3C Semantic Web Activity Lead Home: http://www.w3.org/People/Ivan/ mobile: +31-641044153 FOAF: http://www.ivan-herman.net/foaf.rdf
Received on Friday, 4 May 2012 14:02:59 UTC