Re: Brain teaser for non-PK tables from Ivan Herman on 2012-05-04 (public-rdb2rdf-wg@w3.org from May 2012)

From: Ivan Herman <ivan@w3.org>
Date: Fri, 4 May 2012 15:05:43 +0200
To: Eric Prud'hommeaux <eric@w3.org>
Cc: Juan Sequeda <juanfederico@gmail.com>, Richard Cyganiak <richard@cyganiak.de>, ashok malhotra <ashok.malhotra@oracle.com>, Michael Hausenblas <michael.hausenblas@deri.org>, W3C RDB2RDF <public-rdb2rdf-wg@w3.org>
Message-Id: <FD9565BB-380D-474B-9453-60C7CAF6072E@w3.org>
Eric,

this seems to be a bit drastic for my taste; I would not want to burn the bridges between the R2RML and the DM. The fact that these two are closely related, that, *in general*, the DM is a default case for R2RML is, I believe, a strong feature, a good 'story'. I would not want to loose that.

However, we have to face that there *are* cases when things do not really fit. What about modifying the two documents as follows (note that point #2 is not strictly necessary for the discussion at hand, but it makes the relationships even clearer and stronger):

1. In the DM, instead of "is intended to provide a default behavior for R2RML: RDB to RDF Mapping Language" say "is intended to provide a default behavior for R2RML: RDB to RDF Mapping Language for tables which have at least one unique key"

2. Add to the R2RML document (probably in the intro part): "R2RML implementations are encouraged to provide a default mapping equivalent to the Direct Mapping for tables which have at least one unique key"

3. Add a Note to R2RML 6.1: "Because rr:IRI and rr:BlankNode subject labels are generated from column values, R2RML mappings do not preserve repeated rows in SQL databases."

How does that sound?

Ivan

On May 4, 2012, at 13:43 , Eric Prud'hommeaux wrote:

> * Juan Sequeda <juanfederico@gmail.com> [2012-05-03 20:04-0500]
>> All,
>> 
>> 1) Technically we could (and maybe should) add this to the standard (both
>> DM and R2RML) however...
>> 2) We just realized about the problem now and somebody (Eric/Richard) came
>> up with A solution. The rest of the standard has been built on years of
>> experience. If this problem came up now just now, at the last minute, it
>> means that nobody cared much about this before. That doesn't mean that they
>> won't want it now. But it does mean that we should look into it with more
>> detail, given that we know the issue exists. Down the road, we will know if
>> it is feasible, etc
> 
> We could move along more quickly if we:
> 
>  1. strike "is intended to provide a default behavior for R2RML: RDB
>     to RDF Mapping Language" from DM
> 
>  2. add a Note to R2RML 6.1: "Because rr:IRI and rr:BlankNode subject
>     labels are generated from column values, R2RML mappings do not
>     preserve repeated rows in SQL databases.
> 
> Adding a per-row blank node identifier in v1.1 will be completely
> backward-compatible with v1.0.
> 
> 
>> Juan Sequeda
>> +1-575-SEQ-UEDA
>> www.juansequeda.com
>> 
>> 
>> On Thu, May 3, 2012 at 7:27 PM, Richard Cyganiak <richard@cyganiak.de>wrote:
>> 
>>> Hi Eric,
>>> 
>>> My short response is: The proposal is *optional*. You don't have to
>>> implement it. You don't have to use implementations that don't support it.
>>> It's just an extra sentence or two in the spec. There is clear guidance
>>> which option implementers should support. What harm is there in allowing
>>> the option?
>>> 
>>> You offered one argument against providing this optional feature, and
>>> that's the point about backwards compatibility. Future WGs may find it
>>> difficult to remove this option even if the option becomes obsolete due to
>>> a possible R2RML 1.1 update. I'll address this below.
>>> 
>>> On 3 May 2012, at 22:36, Eric Prud'hommeaux wrote:
>>>> * ashok malhotra <ashok.malhotra@oracle.com> [2012-05-03 12:22-0700]
>>>>> +1 for option 2.  Seems less onerous.   Eric?
>>>> 
>>>> It pains me that folks see me as obstructionist when I may well be
>>>> saving us a 3rd LC. In June of 2006, Fred Zemke spotted a similar
>>>> problem in the semantics of SPARQL wich took us six months to fix
>>>> <http://www.w3.org/mid/4488B936.10705@oracle.com>.
>>> 
>>> The problem in SPARQL was that it specified that implementations MUST NOT
>>> use multiset semantics.
>>> 
>>> The proposal on our table is to RECOMMEND multiset semantics, but state
>>> that implementations MAY use set semantics for compatibility. This is not
>>> comparable to the SPARQL situation.
>>> 
>>> I also note that the 1st LC period and the CR period have passed without
>>> any comments on issues of cardinality.
>>> 
>>>> Speaking with Sam Madden, this seems like less of a corner case than
>>>> we originally thought. He and Zemke asserted that while some base
>>>> tables may have no uniques, it's more common for views materialized
>>>> for performance to preserve only the information required to perform
>>>> some aggregates. Before standardization of SQL, some relational DBs
>>>> operated on sets, others on multisets, and some (Zemke worked on one
>>>> called Britton Lee) preserved repeated rows until one did a
>>>> sort. Customers, particularly those using views, had to be very
>>>> careful in what order they performed various operations.
>>> 
>>> Well, I can see why customers wouldn't be so happy about this, but it's
>>> not quite the same thing here.
>>> 
>>> The order of query operations doesn't matter in the proposed design.
>>> SPARQL has multiset semantics, so even if you query a table with discarded
>>> duplicates, the query execution is with the usual well-defined SPARQL
>>> semantics. It's only in the mapping from non-PK tables to RDF graphs that
>>> cardinality is not maintained.
>>> 
>>>> Juan brought up fixing this in v1. It's easy for v1.1 to relax rigid
>>>> constraints in v1.0, but most charters promise backward compatibility,
>>>> so v1.1 can't impose restrictions not present in v1.0.
>>> 
>>> That all depends on what we write into the spec, doesn't it? The DM spec
>>> could state that the permission for discarding duplicate rows may be
>>> removed in a future version, provided that a future R2RML adds a way of
>>> preserving cardinality on no-PK tables.
>>> 
>>>> Another issue is the performance of very common queries. Under
>>>> multiset semantics, any query which either reports the name of an
>>>> unnamed row requires the complex dance that Richard and I discussed.
>>> 
>>> Yes, these queries are slow.
>>> 
>>>> OTOH, under set semantics, any query which simply restricts or
>>>> projects some row attributes requires a distinct subselect, which is
>>>> either memory intensive or requires a sort of the table.
>>> 
>>> Well, you forget about query optimization, see below.
>>> 
>>>> For example,
>>>> a simple join to get the addresses of folks with year-old debts:
>>>> 
>>>> SELECT ?name ?city
>>>>  WHERE {
>>>>    ?debt <IOUs#name> ?name ;
>>>>          <IOUs#date> ?date ;
>>>>          <IOUs#addr> ?addr .
>>>>    ?addr <Addresses#city> ?city
>>>>    FILTER (?date < "2011-05-03"^^xsd:date)
>>>>  }
>>>> 
>>>> multiset SQL translation:
>>>> SELECT name, city
>>>>   FROM IOUs INNER JOIN Addresses ON IOUs.addr=Addresses.ID
>>>>  WHERE date < "2011-05-03"
>>>> 
>>>> set SQL translation:
>>>> SELECT name, city
>>>>   FROM (
>>>>     SELECT DISTINCT name, date, addr, attr4, attr5
>>>>       FROM IOUs
>>>>      ) IOUs INNER JOIN Addresses ON IOUs.addr=Addresses.ID
>>>>  WHERE date < "2011-05-03"
>>> 
>>> Not having thought about this too hard, the second query doesn't seem
>>> particularly bad. Isn't it equivalent to this?
>>> 
>>> SELECT name, city
>>>  FROM (
>>>    SELECT DISTINCT name, date, addr, attr4, attr5
>>>      FROM IOUs
>>>      WHERE date < "2011-05-03"
>>>      ) IOUs INNER JOIN Addresses ON IOUs.addr=Addresses.ID
>>> 
>>> So the duplicate removal is only necessary over the subset of the table
>>> that is actually being returned in the end. The INNER JOIN can also be
>>> moved inside the DISTINCT, I think. The DISTINCT should then be O(n log n)
>>> where n is the number of result rows, which isn't too bad.
>>> 
>>> IIRC, DISTINCT can be moved up in the algebra tree over most other
>>> operations, except for projections (which can usually be done last without
>>> much performance impact), aggregates (which require more memory than
>>> DISTINCT anyways) and LIMIT (which also limits the memory required for
>>> DISTINCT).
>>> 
>>> D2RQ is fairly smart about moving DISTINCTs around before generating the
>>> final SQL query. I'd expect that most decent query optimizers are even
>>> smarter than what we do.
>>> 
>>>> One could make a pretty good case for preserving the intuitive and
>>>> efficient query mapping for such common queries.
>>> 
>>> 1. For many of these common queries, the DISTINCT is done on a reduced
>>> intermediate result, or even on the final result set, and not on the input
>>> data. So it's not that bad.
>>> 
>>> 2. The strange contortions required for returning subjects may well
>>> reverse the argument here. You make unproven assumptions about what queries
>>> are common.
>>> 
>>> 3. Again, the proposal is *not* to abandon the cardinality-preserving
>>> query mapping. The proposal is to allow another query mapping as well, for
>>> compatibility.
>>> 
>>> Best,
>>> Richard
>>> 
>>> 
>>> 
>>>> 
>>>> 
>>>>> All the best, Ashok
>>>>> 
>>>>> On 5/3/2012 12:10 PM, Juan Sequeda wrote:
>>>>>> 
>>>>>> 
>>>>>> On Thu, May 3, 2012 at 2:01 PM, Richard Cyganiak <richard@cyganiak.de<mailto:
>>> richard@cyganiak.de>> wrote:
>>>>>> 
>>>>>>  On 3 May 2012, at 17:11, Juan Sequeda wrote:
>>>>>>> Do you accept eric's proposal (which hasn't been stated yet):
>>>>>>> 
>>>>>>> 1) Leave DM as-is
>>>>>>> 2) Add the following to R2RML
>>>>>>> 
>>>>>>> rr:subjectMap [
>>>>>>>   rr:termType rr:RowBlankNode
>>>>>>> ];
>>>>>> 
>>>>>>  (I'd prefer calling it rr:BlankNode. The absence of
>>> rr:column/rr:template/rr:constant indicates the new behaviour.)
>>>>>> 
>>>>>>  This is a new feature that was never discussed before. It's not just
>>> a tweak. No existing RDB2RDF mapping language has anything comparable. How
>>> to sensibly implement it, is a somewhat open question, AFAIK. Had this been
>>> proposed a few months ago, everyone would have said, “sounds like an R2RML
>>> 1.1 feature” and we would have postponed it without complaints.
>>>>>> 
>>>>>>  The problem at hand is the an incompatibility between two specs,
>>> let's call them A and B, in a corner case. Now given these choices:
>>>>>> 
>>>>>>  1) Add a new and somewhat risky feature to spec A, at a time when we
>>> thought we were just about to enter PR. Send all implementers of A back to
>>> the drawing board. Delay the WG for an indefinite amount of time, over a
>>> barely relevant corner case.
>>>>>> 
>>>>>>  2) Relax a constraint in spec B to say you SHOULD implement the
>>> “correct” behaviour for this corner case, but MAY also implement another
>>> not entirely unreasonable behaviour that is compatible with A as it is. Add
>>> some alarming language and say: “We expect future versions of A to remove
>>> this limitation.” No implementation changes. Go to PR in three weeks.
>>>>>> 
>>>>>>  To me, 2) makes a lot more sense than 1).
>>>>>> 
>>>>>> 
>>>>>> I agree with Richard. Option 2 seems more reasonable at the moment.
>>>>>> 
>>>>>> We already have other issues to address for a R2RML and DM 1.1
>>> version. This could be part of it. I'm not sure how this works in the
>>> standardization process, but as a group, we believe this particular issue
>>> is a corner case so it's not imperative to include it in the current
>>> version of the standard. However, if users complain about this corner case
>>> (we then realize that it isn't a corner case), we realize we were wrong
>>> from the beginning. I'm guessing this sometimes (usually?) happens in
>>> standards, right?
>>>>>> 
>>>>>> 
>>>>>>  Best,
>>>>>>  Richard
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> Juan Sequeda
>>>>>>> +1-575-SEQ-UEDA
>>>>>>> www.juansequeda.com <http://www.juansequeda.com>
>>>>>>> 
>>>>>>> 
>>>>>>> On Thu, May 3, 2012 at 11:08 AM, Michael Hausenblas <
>>> michael.hausenblas@deri.org <mailto:michael.hausenblas@deri.org>> wrote:
>>>>>>> 
>>>>>>>> Were we close to closing R2RML's CR?
>>>>>>> 
>>>>>>> This was the last issue, all other have been resolved in last weeks
>>> meeting (see also my comments when I sent out the minutes [1]). Never mind,
>>> we're not extending CR but entering a second, rather short LC period.
>>>>>>> 
>>>>>>> Ivan, can you prepare a respective PROPOSAL for next week's meeting
>>> please?
>>>>>>> 
>>>>>>> Cheers,
>>>>>>>         Michael
>>>>>>> 
>>>>>>> [1]
>>> http://lists.w3.org/Archives/Public/public-rdb2rdf-wg/2012May/0005.html
>>>>>>> 
>>>>>>> --
>>>>>>> Dr. Michael Hausenblas, Research Fellow
>>>>>>> DERI - Digital Enterprise Research Institute
>>>>>>> NUIG - National University of Ireland, Galway
>>>>>>> Ireland, Europe
>>>>>>> Tel.: +353 91 495730 <tel:%2B353%2091%20495730>
>>>>>>> WebID: http://sw-app.org/mic.xhtml#i
>>>>>>> 
>>>>>>> On 3 May 2012, at 17:04, Eric Prud'hommeaux wrote:
>>>>>>> 
>>>>>>>> * Juan Sequeda <juanfederico@gmail.com <mailto:
>>> juanfederico@gmail.com>> [2012-05-03 10:50-0500]
>>>>>>>>> Looks like we have to extend CR till
>>>>>>>>> we have implementations for this corner case.
>>>>>>>> 
>>>>>>>> Were we close to closing R2RML's CR?
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> Juan Sequeda
>>>>>>>>> www.juansequeda.com <http://www.juansequeda.com>
>>>>>>>>> 
>>>>>>>>> On May 3, 2012, at 10:42 AM, Richard Cyganiak <richard@cyganiak.de<mailto:
>>> richard@cyganiak.de>> wrote:
>>>>>>>>> 
>>>>>>>>>> On 3 May 2012, at 16:25, Eric Prud'hommeaux wrote:
>>>>>>>>>>> presumes you can create tables, but yeah, conceptually easier
>>> query.
>>>>>>>>>> 
>>>>>>>>>> (It looks like most databases have a proprietary method of adding
>>> the indexes that doesn't require write access to the DB.)
>>>>>>>>>> 
>>>>>>>>>>> you can even push the symbol generation down:
>>>>>>>>>> 
>>>>>>>>>> Right.
>>>>>>>>>> 
>>>>>>>>>>>> The big remaining question is: How to handle this in R2RML?
>>>>>>>>>>> 
>>>>>>>>>>> Looking for an analog to:
>>>>>>>>>>> rr:subjectMap [
>>>>>>>>>>>    rr:column "ROWID";
>>>>>>>>>>>    rr:termType rr:BlankNode
>>>>>>>>>>> ];
>>>>>>>>>>> I'd propose:
>>>>>>>>>>> rr:subjectMap [
>>>>>>>>>>>    rr:termType rr:RowBlankNode
>>>>>>>>>>> ];
>>>>>>>>>> 
>>>>>>>>>> That's an option. Even keeping rr:BlankNode would work — the
>>> absence of an rr:column/rr:template/rr:constant might signal that a fresh
>>> blank node must be allocated for each row.
>>>>>>>>>> 
>>>>>>>>>>> Does that complicate things beyond how much a cardinality
>>> requirement necessarily complicates things?
>>>>>>>>>> 
>>>>>>>>>> Well, the spec only needs to define the graph generated by the
>>> mapping, so in terms of specification it would be a simple enough change.
>>>>>>>>>> 
>>>>>>>>>> The implications for implementers are quite significant though.
>>> It's a new feature, the implementation costs are not trivial, no existing
>>> implementation does this (AFAIK), so there's a certain amount of R&D
>>> required to show that it's implementable.
>>>>>>>>>> 
>>>>>>>>>> Best,
>>>>>>>>>> Richard
>>>>>>>> 
>>>>>>>> --
>>>>>>>> -ericP
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>> 
>>>> --
>>>> -ericP
>>>> 
>>> 
>>> 
> 
> -- 
> -ericP
> 


----
Ivan Herman, W3C Semantic Web Activity Lead
Home: http://www.w3.org/People/Ivan/
mobile: +31-641044153
FOAF: http://www.ivan-herman.net/foaf.rdf
Received on Friday, 4 May 2012 13:03:11 UTC