Re: Brain teaser for non-PK tables from Ivan Herman on 2012-05-04 (public-rdb2rdf-wg@w3.org from May 2012)

From: Ivan Herman <ivan@w3.org>
Date: Fri, 4 May 2012 16:05:27 +0200
To: Souripriya Das <SOURIPRIYA.DAS@oracle.com>
Cc: <ashok.malhotra@oracle.com>, <public-rdb2rdf-wg@w3.org>, <juanfederico@gmail.com>, <richard@cyganiak.de>, <michael.hausenblas@deri.org>, <eric@w3.org>
Message-Id: <BB0B57EC-466A-4B79-8D6C-D52650EE29D3@w3.org>
I defer, on that detail, to those of you who know these things better:-)

Ivan

On May 4, 2012, at 15:32 , Souripriya Das wrote:

> Since some commercial DBs allow NULL unique key,
> please consider replacing 
> "tables which have at least one unique key"
> with
> "tables which have at least one non-NULL unique key"
> 
> Thanks,
> - Souri.
> 
> ----- Original Message -----
> From: ivan@w3.org
> To: juanfederico@gmail.com
> Cc: eric@w3.org, richard@cyganiak.de, ashok.malhotra@oracle.com, michael.hausenblas@deri.org, public-rdb2rdf-wg@w3.org
> Sent: Friday, May 4, 2012 9:22:32 AM GMT -05:00 US/Canada Eastern
> Subject: Re: Brain teaser for non-PK tables
> 
> 
> On May 4, 2012, at 15:18 , Juan Sequeda wrote:
> 
>> This means that we would leave the DM as-is, right?
> 
> On the technical side, yes. These changes are clarifications/editorial.
> 
> Ivan
> 
>> 
>> 
>> Juan Sequeda
>> +1-575-SEQ-UEDA
>> www.juansequeda.com
>> 
>> 
>> On Fri, May 4, 2012 at 8:15 AM, Eric Prud'hommeaux <eric@w3.org> wrote:
>> * Juan Sequeda <juanfederico@gmail.com> [2012-05-04 08:10-0500]
>>> On Fri, May 4, 2012 at 8:05 AM, Ivan Herman <ivan@w3.org> wrote:
>>> 
>>>> Eric,
>>>> 
>>>> this seems to be a bit drastic for my taste; I would not want to burn the
>>>> bridges between the R2RML and the DM. The fact that these two are closely
>>>> related, that, *in general*, the DM is a default case for R2RML is, I
>>>> believe, a strong feature, a good 'story'. I would not want to loose that.
>>>> 
>>>> However, we have to face that there *are* cases when things do not really
>>>> fit. What about modifying the two documents as follows (note that point #2
>>>> is not strictly necessary for the discussion at hand, but it makes the
>>>> relationships even clearer and stronger):
>>>> 
>>>> 1. In the DM, instead of "is intended to provide a default behavior for
>>>> R2RML: RDB to RDF Mapping Language" say "is intended to provide a default
>>>> behavior for R2RML: RDB to RDF Mapping Language for tables which have at
>>>> least one unique key"
>>>> 
>>> 
>>> +1
>> 
>> +1
>> 
>>>> 2. Add to the R2RML document (probably in the intro part): "R2RML
>>>> implementations are encouraged to provide a default mapping equivalent to
>>>> the Direct Mapping for tables which have at least one unique key"
>>>> 
>>> 
>>> +1
>> 
>> +1
>> 
>>>> 3. Add a Note to R2RML 6.1: "Because rr:IRI and rr:BlankNode subject
>>>> labels are generated from column values, R2RML mappings do not preserve
>>>> repeated rows in SQL databases."
>>>> 
>>> 
>>> +1
>> 
>> +1
>> 
>>>> How does that sound?
>>>> 
>>>> Ivan
>>>> 
>>>> On May 4, 2012, at 13:43 , Eric Prud'hommeaux wrote:
>>>> 
>>>>> * Juan Sequeda <juanfederico@gmail.com> [2012-05-03 20:04-0500]
>>>>>> All,
>>>>>> 
>>>>>> 1) Technically we could (and maybe should) add this to the standard
>>>> (both
>>>>>> DM and R2RML) however...
>>>>>> 2) We just realized about the problem now and somebody (Eric/Richard)
>>>> came
>>>>>> up with A solution. The rest of the standard has been built on years of
>>>>>> experience. If this problem came up now just now, at the last minute, it
>>>>>> means that nobody cared much about this before. That doesn't mean that
>>>> they
>>>>>> won't want it now. But it does mean that we should look into it with
>>>> more
>>>>>> detail, given that we know the issue exists. Down the road, we will
>>>> know if
>>>>>> it is feasible, etc
>>>>> 
>>>>> We could move along more quickly if we:
>>>>> 
>>>>> 1. strike "is intended to provide a default behavior for R2RML: RDB
>>>>>    to RDF Mapping Language" from DM
>>>>> 
>>>>> 2. add a Note to R2RML 6.1: "Because rr:IRI and rr:BlankNode subject
>>>>>    labels are generated from column values, R2RML mappings do not
>>>>>    preserve repeated rows in SQL databases.
>>>>> 
>>>>> Adding a per-row blank node identifier in v1.1 will be completely
>>>>> backward-compatible with v1.0.
>>>>> 
>>>>> 
>>>>>> Juan Sequeda
>>>>>> +1-575-SEQ-UEDA
>>>>>> www.juansequeda.com
>>>>>> 
>>>>>> 
>>>>>> On Thu, May 3, 2012 at 7:27 PM, Richard Cyganiak <richard@cyganiak.de
>>>>> wrote:
>>>>>> 
>>>>>>> Hi Eric,
>>>>>>> 
>>>>>>> My short response is: The proposal is *optional*. You don't have to
>>>>>>> implement it. You don't have to use implementations that don't support
>>>> it.
>>>>>>> It's just an extra sentence or two in the spec. There is clear guidance
>>>>>>> which option implementers should support. What harm is there in
>>>> allowing
>>>>>>> the option?
>>>>>>> 
>>>>>>> You offered one argument against providing this optional feature, and
>>>>>>> that's the point about backwards compatibility. Future WGs may find it
>>>>>>> difficult to remove this option even if the option becomes obsolete
>>>> due to
>>>>>>> a possible R2RML 1.1 update. I'll address this below.
>>>>>>> 
>>>>>>> On 3 May 2012, at 22:36, Eric Prud'hommeaux wrote:
>>>>>>>> * ashok malhotra <ashok.malhotra@oracle.com> [2012-05-03 12:22-0700]
>>>>>>>>> +1 for option 2.  Seems less onerous.   Eric?
>>>>>>>> 
>>>>>>>> It pains me that folks see me as obstructionist when I may well be
>>>>>>>> saving us a 3rd LC. In June of 2006, Fred Zemke spotted a similar
>>>>>>>> problem in the semantics of SPARQL wich took us six months to fix
>>>>>>>> <http://www.w3.org/mid/4488B936.10705@oracle.com>.
>>>>>>> 
>>>>>>> The problem in SPARQL was that it specified that implementations MUST
>>>> NOT
>>>>>>> use multiset semantics.
>>>>>>> 
>>>>>>> The proposal on our table is to RECOMMEND multiset semantics, but state
>>>>>>> that implementations MAY use set semantics for compatibility. This is
>>>> not
>>>>>>> comparable to the SPARQL situation.
>>>>>>> 
>>>>>>> I also note that the 1st LC period and the CR period have passed
>>>> without
>>>>>>> any comments on issues of cardinality.
>>>>>>> 
>>>>>>>> Speaking with Sam Madden, this seems like less of a corner case than
>>>>>>>> we originally thought. He and Zemke asserted that while some base
>>>>>>>> tables may have no uniques, it's more common for views materialized
>>>>>>>> for performance to preserve only the information required to perform
>>>>>>>> some aggregates. Before standardization of SQL, some relational DBs
>>>>>>>> operated on sets, others on multisets, and some (Zemke worked on one
>>>>>>>> called Britton Lee) preserved repeated rows until one did a
>>>>>>>> sort. Customers, particularly those using views, had to be very
>>>>>>>> careful in what order they performed various operations.
>>>>>>> 
>>>>>>> Well, I can see why customers wouldn't be so happy about this, but it's
>>>>>>> not quite the same thing here.
>>>>>>> 
>>>>>>> The order of query operations doesn't matter in the proposed design.
>>>>>>> SPARQL has multiset semantics, so even if you query a table with
>>>> discarded
>>>>>>> duplicates, the query execution is with the usual well-defined SPARQL
>>>>>>> semantics. It's only in the mapping from non-PK tables to RDF graphs
>>>> that
>>>>>>> cardinality is not maintained.
>>>>>>> 
>>>>>>>> Juan brought up fixing this in v1. It's easy for v1.1 to relax rigid
>>>>>>>> constraints in v1.0, but most charters promise backward compatibility,
>>>>>>>> so v1.1 can't impose restrictions not present in v1.0.
>>>>>>> 
>>>>>>> That all depends on what we write into the spec, doesn't it? The DM
>>>> spec
>>>>>>> could state that the permission for discarding duplicate rows may be
>>>>>>> removed in a future version, provided that a future R2RML adds a way of
>>>>>>> preserving cardinality on no-PK tables.
>>>>>>> 
>>>>>>>> Another issue is the performance of very common queries. Under
>>>>>>>> multiset semantics, any query which either reports the name of an
>>>>>>>> unnamed row requires the complex dance that Richard and I discussed.
>>>>>>> 
>>>>>>> Yes, these queries are slow.
>>>>>>> 
>>>>>>>> OTOH, under set semantics, any query which simply restricts or
>>>>>>>> projects some row attributes requires a distinct subselect, which is
>>>>>>>> either memory intensive or requires a sort of the table.
>>>>>>> 
>>>>>>> Well, you forget about query optimization, see below.
>>>>>>> 
>>>>>>>> For example,
>>>>>>>> a simple join to get the addresses of folks with year-old debts:
>>>>>>>> 
>>>>>>>> SELECT ?name ?city
>>>>>>>> WHERE {
>>>>>>>>   ?debt <IOUs#name> ?name ;
>>>>>>>>         <IOUs#date> ?date ;
>>>>>>>>         <IOUs#addr> ?addr .
>>>>>>>>   ?addr <Addresses#city> ?city
>>>>>>>>   FILTER (?date < "2011-05-03"^^xsd:date)
>>>>>>>> }
>>>>>>>> 
>>>>>>>> multiset SQL translation:
>>>>>>>> SELECT name, city
>>>>>>>>  FROM IOUs INNER JOIN Addresses ON IOUs.addr=Addresses.ID
>>>>>>>> WHERE date < "2011-05-03"
>>>>>>>> 
>>>>>>>> set SQL translation:
>>>>>>>> SELECT name, city
>>>>>>>>  FROM (
>>>>>>>>    SELECT DISTINCT name, date, addr, attr4, attr5
>>>>>>>>      FROM IOUs
>>>>>>>>     ) IOUs INNER JOIN Addresses ON IOUs.addr=Addresses.ID
>>>>>>>> WHERE date < "2011-05-03"
>>>>>>> 
>>>>>>> Not having thought about this too hard, the second query doesn't seem
>>>>>>> particularly bad. Isn't it equivalent to this?
>>>>>>> 
>>>>>>> SELECT name, city
>>>>>>> FROM (
>>>>>>>   SELECT DISTINCT name, date, addr, attr4, attr5
>>>>>>>     FROM IOUs
>>>>>>>     WHERE date < "2011-05-03"
>>>>>>>     ) IOUs INNER JOIN Addresses ON IOUs.addr=Addresses.ID
>>>>>>> 
>>>>>>> So the duplicate removal is only necessary over the subset of the table
>>>>>>> that is actually being returned in the end. The INNER JOIN can also be
>>>>>>> moved inside the DISTINCT, I think. The DISTINCT should then be O(n
>>>> log n)
>>>>>>> where n is the number of result rows, which isn't too bad.
>>>>>>> 
>>>>>>> IIRC, DISTINCT can be moved up in the algebra tree over most other
>>>>>>> operations, except for projections (which can usually be done last
>>>> without
>>>>>>> much performance impact), aggregates (which require more memory than
>>>>>>> DISTINCT anyways) and LIMIT (which also limits the memory required for
>>>>>>> DISTINCT).
>>>>>>> 
>>>>>>> D2RQ is fairly smart about moving DISTINCTs around before generating
>>>> the
>>>>>>> final SQL query. I'd expect that most decent query optimizers are even
>>>>>>> smarter than what we do.
>>>>>>> 
>>>>>>>> One could make a pretty good case for preserving the intuitive and
>>>>>>>> efficient query mapping for such common queries.
>>>>>>> 
>>>>>>> 1. For many of these common queries, the DISTINCT is done on a reduced
>>>>>>> intermediate result, or even on the final result set, and not on the
>>>> input
>>>>>>> data. So it's not that bad.
>>>>>>> 
>>>>>>> 2. The strange contortions required for returning subjects may well
>>>>>>> reverse the argument here. You make unproven assumptions about what
>>>> queries
>>>>>>> are common.
>>>>>>> 
>>>>>>> 3. Again, the proposal is *not* to abandon the cardinality-preserving
>>>>>>> query mapping. The proposal is to allow another query mapping as well,
>>>> for
>>>>>>> compatibility.
>>>>>>> 
>>>>>>> Best,
>>>>>>> Richard
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> All the best, Ashok
>>>>>>>>> 
>>>>>>>>> On 5/3/2012 12:10 PM, Juan Sequeda wrote:
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On Thu, May 3, 2012 at 2:01 PM, Richard Cyganiak <
>>>> richard@cyganiak.de<mailto:
>>>>>>> richard@cyganiak.de>> wrote:
>>>>>>>>>> 
>>>>>>>>>> On 3 May 2012, at 17:11, Juan Sequeda wrote:
>>>>>>>>>>> Do you accept eric's proposal (which hasn't been stated yet):
>>>>>>>>>>> 
>>>>>>>>>>> 1) Leave DM as-is
>>>>>>>>>>> 2) Add the following to R2RML
>>>>>>>>>>> 
>>>>>>>>>>> rr:subjectMap [
>>>>>>>>>>>  rr:termType rr:RowBlankNode
>>>>>>>>>>> ];
>>>>>>>>>> 
>>>>>>>>>> (I'd prefer calling it rr:BlankNode. The absence of
>>>>>>> rr:column/rr:template/rr:constant indicates the new behaviour.)
>>>>>>>>>> 
>>>>>>>>>> This is a new feature that was never discussed before. It's not
>>>> just
>>>>>>> a tweak. No existing RDB2RDF mapping language has anything comparable.
>>>> How
>>>>>>> to sensibly implement it, is a somewhat open question, AFAIK. Had this
>>>> been
>>>>>>> proposed a few months ago, everyone would have said, “sounds like an
>>>> R2RML
>>>>>>> 1.1 feature” and we would have postponed it without complaints.
>>>>>>>>>> 
>>>>>>>>>> The problem at hand is the an incompatibility between two specs,
>>>>>>> let's call them A and B, in a corner case. Now given these choices:
>>>>>>>>>> 
>>>>>>>>>> 1) Add a new and somewhat risky feature to spec A, at a time when
>>>> we
>>>>>>> thought we were just about to enter PR. Send all implementers of A
>>>> back to
>>>>>>> the drawing board. Delay the WG for an indefinite amount of time, over
>>>> a
>>>>>>> barely relevant corner case.
>>>>>>>>>> 
>>>>>>>>>> 2) Relax a constraint in spec B to say you SHOULD implement the
>>>>>>> “correct” behaviour for this corner case, but MAY also implement
>>>> another
>>>>>>> not entirely unreasonable behaviour that is compatible with A as it
>>>> is. Add
>>>>>>> some alarming language and say: “We expect future versions of A to
>>>> remove
>>>>>>> this limitation.” No implementation changes. Go to PR in three weeks.
>>>>>>>>>> 
>>>>>>>>>> To me, 2) makes a lot more sense than 1).
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> I agree with Richard. Option 2 seems more reasonable at the moment.
>>>>>>>>>> 
>>>>>>>>>> We already have other issues to address for a R2RML and DM 1.1
>>>>>>> version. This could be part of it. I'm not sure how this works in the
>>>>>>> standardization process, but as a group, we believe this particular
>>>> issue
>>>>>>> is a corner case so it's not imperative to include it in the current
>>>>>>> version of the standard. However, if users complain about this corner
>>>> case
>>>>>>> (we then realize that it isn't a corner case), we realize we were wrong
>>>>>>> from the beginning. I'm guessing this sometimes (usually?) happens in
>>>>>>> standards, right?
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Best,
>>>>>>>>>> Richard
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> Juan Sequeda
>>>>>>>>>>> +1-575-SEQ-UEDA
>>>>>>>>>>> www.juansequeda.com <http://www.juansequeda.com>
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> On Thu, May 3, 2012 at 11:08 AM, Michael Hausenblas <
>>>>>>> michael.hausenblas@deri.org <mailto:michael.hausenblas@deri.org>>
>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> Were we close to closing R2RML's CR?
>>>>>>>>>>> 
>>>>>>>>>>> This was the last issue, all other have been resolved in last weeks
>>>>>>> meeting (see also my comments when I sent out the minutes [1]). Never
>>>> mind,
>>>>>>> we're not extending CR but entering a second, rather short LC period.
>>>>>>>>>>> 
>>>>>>>>>>> Ivan, can you prepare a respective PROPOSAL for next week's meeting
>>>>>>> please?
>>>>>>>>>>> 
>>>>>>>>>>> Cheers,
>>>>>>>>>>>        Michael
>>>>>>>>>>> 
>>>>>>>>>>> [1]
>>>>>>> 
>>>> http://lists.w3.org/Archives/Public/public-rdb2rdf-wg/2012May/0005.html
>>>>>>>>>>> 
>>>>>>>>>>> --
>>>>>>>>>>> Dr. Michael Hausenblas, Research Fellow
>>>>>>>>>>> DERI - Digital Enterprise Research Institute
>>>>>>>>>>> NUIG - National University of Ireland, Galway
>>>>>>>>>>> Ireland, Europe
>>>>>>>>>>> Tel.: +353 91 495730 <tel:%2B353%2091%20495730>
>>>>>>>>>>> WebID: http://sw-app.org/mic.xhtml#i
>>>>>>>>>>> 
>>>>>>>>>>> On 3 May 2012, at 17:04, Eric Prud'hommeaux wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> * Juan Sequeda <juanfederico@gmail.com <mailto:
>>>>>>> juanfederico@gmail.com>> [2012-05-03 10:50-0500]
>>>>>>>>>>>>> Looks like we have to extend CR till
>>>>>>>>>>>>> we have implementations for this corner case.
>>>>>>>>>>>> 
>>>>>>>>>>>> Were we close to closing R2RML's CR?
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>>> Juan Sequeda
>>>>>>>>>>>>> www.juansequeda.com <http://www.juansequeda.com>
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On May 3, 2012, at 10:42 AM, Richard Cyganiak <
>>>> richard@cyganiak.de<mailto:
>>>>>>> richard@cyganiak.de>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On 3 May 2012, at 16:25, Eric Prud'hommeaux wrote:
>>>>>>>>>>>>>>> presumes you can create tables, but yeah, conceptually easier
>>>>>>> query.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> (It looks like most databases have a proprietary method of
>>>> adding
>>>>>>> the indexes that doesn't require write access to the DB.)
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> you can even push the symbol generation down:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Right.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> The big remaining question is: How to handle this in R2RML?
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Looking for an analog to:
>>>>>>>>>>>>>>> rr:subjectMap [
>>>>>>>>>>>>>>>   rr:column "ROWID";
>>>>>>>>>>>>>>>   rr:termType rr:BlankNode
>>>>>>>>>>>>>>> ];
>>>>>>>>>>>>>>> I'd propose:
>>>>>>>>>>>>>>> rr:subjectMap [
>>>>>>>>>>>>>>>   rr:termType rr:RowBlankNode
>>>>>>>>>>>>>>> ];
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> That's an option. Even keeping rr:BlankNode would work — the
>>>>>>> absence of an rr:column/rr:template/rr:constant might signal that a
>>>> fresh
>>>>>>> blank node must be allocated for each row.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Does that complicate things beyond how much a cardinality
>>>>>>> requirement necessarily complicates things?
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Well, the spec only needs to define the graph generated by the
>>>>>>> mapping, so in terms of specification it would be a simple enough
>>>> change.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> The implications for implementers are quite significant though.
>>>>>>> It's a new feature, the implementation costs are not trivial, no
>>>> existing
>>>>>>> implementation does this (AFAIK), so there's a certain amount of R&D
>>>>>>> required to show that it's implementable.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>> Richard
>>>>>>>>>>>> 
>>>>>>>>>>>> --
>>>>>>>>>>>> -ericP
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>> 
>>>>>>>> --
>>>>>>>> -ericP
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>> 
>>>>> --
>>>>> -ericP
>>>>> 
>>>> 
>>>> 
>>>> ----
>>>> Ivan Herman, W3C Semantic Web Activity Lead
>>>> Home: http://www.w3.org/People/Ivan/
>>>> mobile: +31-641044153
>>>> FOAF: http://www.ivan-herman.net/foaf.rdf
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>> 
>> --
>> -ericP
>> 
> 
> 
> ----
> Ivan Herman, W3C Semantic Web Activity Lead
> Home: http://www.w3.org/People/Ivan/
> mobile: +31-641044153
> FOAF: http://www.ivan-herman.net/foaf.rdf
> 
> 
> 
> 
> 
> 
> 


----
Ivan Herman, W3C Semantic Web Activity Lead
Home: http://www.w3.org/People/Ivan/
mobile: +31-641044153
FOAF: http://www.ivan-herman.net/foaf.rdf
Received on Friday, 4 May 2012 14:02:59 UTC