Re: Brain teaser for non-PK tables from Ivan Herman on 2012-05-04 (public-rdb2rdf-wg@w3.org from May 2012)

From: Ivan Herman <ivan@w3.org>
Date: Fri, 4 May 2012 16:12:54 +0200
To: Richard Cyganiak <richard@cyganiak.de>
Cc: Eric Prud'hommeaux <eric@w3.org>, Juan Sequeda <juanfederico@gmail.com>, ashok malhotra <ashok.malhotra@oracle.com>, Michael Hausenblas <michael.hausenblas@deri.org>, W3C RDB2RDF <public-rdb2rdf-wg@w3.org>
Message-Id: <DE874CFC-2000-40FF-82CA-8A80F910FA38@w3.org>
Richard,

I had this alternative:

http://lists.w3.org/Archives/Public/public-rdb2rdf-wg/2012May/0036.html

my goal is to simply document the present situation v.a.v. the two documents. I believe that, for the vast majority of cases, the original relationship remains and it just documents the corner case where problems arise. R2RML1.1 may then amend that at some point, but that is not our problem right now.

Cheers

Ivan

On May 4, 2012, at 16:00 , Richard Cyganiak wrote:

> Hi Eric,
> 
> On 4 May 2012, at 12:43, Eric Prud'hommeaux wrote:
>> We could move along more quickly if we:
>> 
>> 1. strike "is intended to provide a default behavior for R2RML: RDB
>>    to RDF Mapping Language" from DM
> 
> This WG is chartered to produce a mapping language, tentatively called R2RML in the charter. R2RML is the WG's main deliverable.
> 
> This WG is *not* chartered to produce a Direct Mapping. Most certainly not as a Recommendation-track document.
> 
> Earlier in the WG's lifetime, an understanding was reached that the WG will also work on the Direct Mapping as a second Recommendation that adds value to R2RML by providing a useful default mapping. I have supported this bifurcation of the WG's effort based on this understanding. I now regret this because keeping the DM compatible with R2RML has been a constant uphill battle and drain of WG resources ever since, with apparently no interest whatsoever from you in such compatibility.
> 
> Your proposal now is to cut the DM's lifeline by objecting to compatibility with R2RML 1.0.
> 
> I don't see how this WG could send something to Recommendation that isn't in the charter, and that doesn't add value to the main deliverable that the WG is chartered to produce.
> 
> So here's another attempt at saving compatibility. In the DM spec, Section 3:
> 
> REMOVE: [[
> If the table has no primary key, the row node is a fresh blank node that is unique to this row.
> ]]
> 
> ADD: [[
> If the table has no primary key, the row node is a fresh blank node. The blank node SHOULD be unique to this row (“cardinality-preserving Direct Mapping”). Alternatively, the blank node MAY be unique to the list of column values in the row (“non-cardinality-preserving Direct Mapping”). Implementations MUST document their choice in this regard.
> 
> NOTE: Only the “cardinality-preserving Direct Mapping” preserves cardinality in the case of tables with duplicate rows, and hence this behavior should be strongly preferred. The alternative behavior is permitted to maintain compatibility with R2RML 1.0, which does not allow the creation of distinct blank nodes for duplicate rows. It is expected that future versions of R2RML will address this limitation, and that a future versions of the Direct Mapping specification will then no longer permit the non-cardinality-preserving Direct Mapping.
> ]]
> 
>> 2. add a Note to R2RML 6.1: "Because rr:IRI and rr:BlankNode subject
>>    labels are generated from column values, R2RML mappings do not
>>    preserve repeated rows in SQL databases.
> 
> +1. And add: “It is expected that a future version of R2RML will address this limitation.” And s/repeated/duplicate/ (“repeated” implies ordering and adjacency)
> 
> If this doesn't work for you, then I have to repeat my question:
> 
>>> My short response is: The proposal is *optional*. You don't have to
>>> implement it. You don't have to use implementations that don't support it.
>>> It's an extra sentence or two in the spec. There is clear guidance
>>> which option implementers should support. What harm is there in allowing
>>> the option?
> 
> Best,
> Richard
> 
> 
> 
>> 
>> Adding a per-row blank node identifier in v1.1 will be completely
>> backward-compatible with v1.0.
>> 
>> 
>>> Juan Sequeda
>>> +1-575-SEQ-UEDA
>>> www.juansequeda.com
>>> 
>>> 
>>> On Thu, May 3, 2012 at 7:27 PM, Richard Cyganiak <richard@cyganiak.de>wrote:
>>> 
>>>> Hi Eric,
>>>> 
>>>> My short response is: The proposal is *optional*. You don't have to
>>>> implement it. You don't have to use implementations that don't support it.
>>>> It's just an extra sentence or two in the spec. There is clear guidance
>>>> which option implementers should support. What harm is there in allowing
>>>> the option?
>>>> 
>>>> You offered one argument against providing this optional feature, and
>>>> that's the point about backwards compatibility. Future WGs may find it
>>>> difficult to remove this option even if the option becomes obsolete due to
>>>> a possible R2RML 1.1 update. I'll address this below.
>>>> 
>>>> On 3 May 2012, at 22:36, Eric Prud'hommeaux wrote:
>>>>> * ashok malhotra <ashok.malhotra@oracle.com> [2012-05-03 12:22-0700]
>>>>>> +1 for option 2.  Seems less onerous.   Eric?
>>>>> 
>>>>> It pains me that folks see me as obstructionist when I may well be
>>>>> saving us a 3rd LC. In June of 2006, Fred Zemke spotted a similar
>>>>> problem in the semantics of SPARQL wich took us six months to fix
>>>>> <http://www.w3.org/mid/4488B936.10705@oracle.com>.
>>>> 
>>>> The problem in SPARQL was that it specified that implementations MUST NOT
>>>> use multiset semantics.
>>>> 
>>>> The proposal on our table is to RECOMMEND multiset semantics, but state
>>>> that implementations MAY use set semantics for compatibility. This is not
>>>> comparable to the SPARQL situation.
>>>> 
>>>> I also note that the 1st LC period and the CR period have passed without
>>>> any comments on issues of cardinality.
>>>> 
>>>>> Speaking with Sam Madden, this seems like less of a corner case than
>>>>> we originally thought. He and Zemke asserted that while some base
>>>>> tables may have no uniques, it's more common for views materialized
>>>>> for performance to preserve only the information required to perform
>>>>> some aggregates. Before standardization of SQL, some relational DBs
>>>>> operated on sets, others on multisets, and some (Zemke worked on one
>>>>> called Britton Lee) preserved repeated rows until one did a
>>>>> sort. Customers, particularly those using views, had to be very
>>>>> careful in what order they performed various operations.
>>>> 
>>>> Well, I can see why customers wouldn't be so happy about this, but it's
>>>> not quite the same thing here.
>>>> 
>>>> The order of query operations doesn't matter in the proposed design.
>>>> SPARQL has multiset semantics, so even if you query a table with discarded
>>>> duplicates, the query execution is with the usual well-defined SPARQL
>>>> semantics. It's only in the mapping from non-PK tables to RDF graphs that
>>>> cardinality is not maintained.
>>>> 
>>>>> Juan brought up fixing this in v1. It's easy for v1.1 to relax rigid
>>>>> constraints in v1.0, but most charters promise backward compatibility,
>>>>> so v1.1 can't impose restrictions not present in v1.0.
>>>> 
>>>> That all depends on what we write into the spec, doesn't it? The DM spec
>>>> could state that the permission for discarding duplicate rows may be
>>>> removed in a future version, provided that a future R2RML adds a way of
>>>> preserving cardinality on no-PK tables.
>>>> 
>>>>> Another issue is the performance of very common queries. Under
>>>>> multiset semantics, any query which either reports the name of an
>>>>> unnamed row requires the complex dance that Richard and I discussed.
>>>> 
>>>> Yes, these queries are slow.
>>>> 
>>>>> OTOH, under set semantics, any query which simply restricts or
>>>>> projects some row attributes requires a distinct subselect, which is
>>>>> either memory intensive or requires a sort of the table.
>>>> 
>>>> Well, you forget about query optimization, see below.
>>>> 
>>>>> For example,
>>>>> a simple join to get the addresses of folks with year-old debts:
>>>>> 
>>>>> SELECT ?name ?city
>>>>> WHERE {
>>>>>   ?debt <IOUs#name> ?name ;
>>>>>         <IOUs#date> ?date ;
>>>>>         <IOUs#addr> ?addr .
>>>>>   ?addr <Addresses#city> ?city
>>>>>   FILTER (?date < "2011-05-03"^^xsd:date)
>>>>> }
>>>>> 
>>>>> multiset SQL translation:
>>>>> SELECT name, city
>>>>>  FROM IOUs INNER JOIN Addresses ON IOUs.addr=Addresses.ID
>>>>> WHERE date < "2011-05-03"
>>>>> 
>>>>> set SQL translation:
>>>>> SELECT name, city
>>>>>  FROM (
>>>>>    SELECT DISTINCT name, date, addr, attr4, attr5
>>>>>      FROM IOUs
>>>>>     ) IOUs INNER JOIN Addresses ON IOUs.addr=Addresses.ID
>>>>> WHERE date < "2011-05-03"
>>>> 
>>>> Not having thought about this too hard, the second query doesn't seem
>>>> particularly bad. Isn't it equivalent to this?
>>>> 
>>>> SELECT name, city
>>>> FROM (
>>>>   SELECT DISTINCT name, date, addr, attr4, attr5
>>>>     FROM IOUs
>>>>     WHERE date < "2011-05-03"
>>>>     ) IOUs INNER JOIN Addresses ON IOUs.addr=Addresses.ID
>>>> 
>>>> So the duplicate removal is only necessary over the subset of the table
>>>> that is actually being returned in the end. The INNER JOIN can also be
>>>> moved inside the DISTINCT, I think. The DISTINCT should then be O(n log n)
>>>> where n is the number of result rows, which isn't too bad.
>>>> 
>>>> IIRC, DISTINCT can be moved up in the algebra tree over most other
>>>> operations, except for projections (which can usually be done last without
>>>> much performance impact), aggregates (which require more memory than
>>>> DISTINCT anyways) and LIMIT (which also limits the memory required for
>>>> DISTINCT).
>>>> 
>>>> D2RQ is fairly smart about moving DISTINCTs around before generating the
>>>> final SQL query. I'd expect that most decent query optimizers are even
>>>> smarter than what we do.
>>>> 
>>>>> One could make a pretty good case for preserving the intuitive and
>>>>> efficient query mapping for such common queries.
>>>> 
>>>> 1. For many of these common queries, the DISTINCT is done on a reduced
>>>> intermediate result, or even on the final result set, and not on the input
>>>> data. So it's not that bad.
>>>> 
>>>> 2. The strange contortions required for returning subjects may well
>>>> reverse the argument here. You make unproven assumptions about what queries
>>>> are common.
>>>> 
>>>> 3. Again, the proposal is *not* to abandon the cardinality-preserving
>>>> query mapping. The proposal is to allow another query mapping as well, for
>>>> compatibility.
>>>> 
>>>> Best,
>>>> Richard
>>>> 
>>>> 
>>>> 
>>>>> 
>>>>> 
>>>>>> All the best, Ashok
>>>>>> 
>>>>>> On 5/3/2012 12:10 PM, Juan Sequeda wrote:
>>>>>>> 
>>>>>>> 
>>>>>>> On Thu, May 3, 2012 at 2:01 PM, Richard Cyganiak <richard@cyganiak.de<mailto:
>>>> richard@cyganiak.de>> wrote:
>>>>>>> 
>>>>>>> On 3 May 2012, at 17:11, Juan Sequeda wrote:
>>>>>>>> Do you accept eric's proposal (which hasn't been stated yet):
>>>>>>>> 
>>>>>>>> 1) Leave DM as-is
>>>>>>>> 2) Add the following to R2RML
>>>>>>>> 
>>>>>>>> rr:subjectMap [
>>>>>>>>  rr:termType rr:RowBlankNode
>>>>>>>> ];
>>>>>>> 
>>>>>>> (I'd prefer calling it rr:BlankNode. The absence of
>>>> rr:column/rr:template/rr:constant indicates the new behaviour.)
>>>>>>> 
>>>>>>> This is a new feature that was never discussed before. It's not just
>>>> a tweak. No existing RDB2RDF mapping language has anything comparable. How
>>>> to sensibly implement it, is a somewhat open question, AFAIK. Had this been
>>>> proposed a few months ago, everyone would have said, “sounds like an R2RML
>>>> 1.1 feature” and we would have postponed it without complaints.
>>>>>>> 
>>>>>>> The problem at hand is the an incompatibility between two specs,
>>>> let's call them A and B, in a corner case. Now given these choices:
>>>>>>> 
>>>>>>> 1) Add a new and somewhat risky feature to spec A, at a time when we
>>>> thought we were just about to enter PR. Send all implementers of A back to
>>>> the drawing board. Delay the WG for an indefinite amount of time, over a
>>>> barely relevant corner case.
>>>>>>> 
>>>>>>> 2) Relax a constraint in spec B to say you SHOULD implement the
>>>> “correct” behaviour for this corner case, but MAY also implement another
>>>> not entirely unreasonable behaviour that is compatible with A as it is. Add
>>>> some alarming language and say: “We expect future versions of A to remove
>>>> this limitation.” No implementation changes. Go to PR in three weeks.
>>>>>>> 
>>>>>>> To me, 2) makes a lot more sense than 1).
>>>>>>> 
>>>>>>> 
>>>>>>> I agree with Richard. Option 2 seems more reasonable at the moment.
>>>>>>> 
>>>>>>> We already have other issues to address for a R2RML and DM 1.1
>>>> version. This could be part of it. I'm not sure how this works in the
>>>> standardization process, but as a group, we believe this particular issue
>>>> is a corner case so it's not imperative to include it in the current
>>>> version of the standard. However, if users complain about this corner case
>>>> (we then realize that it isn't a corner case), we realize we were wrong
>>>> from the beginning. I'm guessing this sometimes (usually?) happens in
>>>> standards, right?
>>>>>>> 
>>>>>>> 
>>>>>>> Best,
>>>>>>> Richard
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Juan Sequeda
>>>>>>>> +1-575-SEQ-UEDA
>>>>>>>> www.juansequeda.com <http://www.juansequeda.com>
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Thu, May 3, 2012 at 11:08 AM, Michael Hausenblas <
>>>> michael.hausenblas@deri.org <mailto:michael.hausenblas@deri.org>> wrote:
>>>>>>>> 
>>>>>>>>> Were we close to closing R2RML's CR?
>>>>>>>> 
>>>>>>>> This was the last issue, all other have been resolved in last weeks
>>>> meeting (see also my comments when I sent out the minutes [1]). Never mind,
>>>> we're not extending CR but entering a second, rather short LC period.
>>>>>>>> 
>>>>>>>> Ivan, can you prepare a respective PROPOSAL for next week's meeting
>>>> please?
>>>>>>>> 
>>>>>>>> Cheers,
>>>>>>>>        Michael
>>>>>>>> 
>>>>>>>> [1]
>>>> http://lists.w3.org/Archives/Public/public-rdb2rdf-wg/2012May/0005.html
>>>>>>>> 
>>>>>>>> --
>>>>>>>> Dr. Michael Hausenblas, Research Fellow
>>>>>>>> DERI - Digital Enterprise Research Institute
>>>>>>>> NUIG - National University of Ireland, Galway
>>>>>>>> Ireland, Europe
>>>>>>>> Tel.: +353 91 495730 <tel:%2B353%2091%20495730>
>>>>>>>> WebID: http://sw-app.org/mic.xhtml#i
>>>>>>>> 
>>>>>>>> On 3 May 2012, at 17:04, Eric Prud'hommeaux wrote:
>>>>>>>> 
>>>>>>>>> * Juan Sequeda <juanfederico@gmail.com <mailto:
>>>> juanfederico@gmail.com>> [2012-05-03 10:50-0500]
>>>>>>>>>> Looks like we have to extend CR till
>>>>>>>>>> we have implementations for this corner case.
>>>>>>>>> 
>>>>>>>>> Were we close to closing R2RML's CR?
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> Juan Sequeda
>>>>>>>>>> www.juansequeda.com <http://www.juansequeda.com>
>>>>>>>>>> 
>>>>>>>>>> On May 3, 2012, at 10:42 AM, Richard Cyganiak <richard@cyganiak.de<mailto:
>>>> richard@cyganiak.de>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> On 3 May 2012, at 16:25, Eric Prud'hommeaux wrote:
>>>>>>>>>>>> presumes you can create tables, but yeah, conceptually easier
>>>> query.
>>>>>>>>>>> 
>>>>>>>>>>> (It looks like most databases have a proprietary method of adding
>>>> the indexes that doesn't require write access to the DB.)
>>>>>>>>>>> 
>>>>>>>>>>>> you can even push the symbol generation down:
>>>>>>>>>>> 
>>>>>>>>>>> Right.
>>>>>>>>>>> 
>>>>>>>>>>>>> The big remaining question is: How to handle this in R2RML?
>>>>>>>>>>>> 
>>>>>>>>>>>> Looking for an analog to:
>>>>>>>>>>>> rr:subjectMap [
>>>>>>>>>>>>   rr:column "ROWID";
>>>>>>>>>>>>   rr:termType rr:BlankNode
>>>>>>>>>>>> ];
>>>>>>>>>>>> I'd propose:
>>>>>>>>>>>> rr:subjectMap [
>>>>>>>>>>>>   rr:termType rr:RowBlankNode
>>>>>>>>>>>> ];
>>>>>>>>>>> 
>>>>>>>>>>> That's an option. Even keeping rr:BlankNode would work — the
>>>> absence of an rr:column/rr:template/rr:constant might signal that a fresh
>>>> blank node must be allocated for each row.
>>>>>>>>>>> 
>>>>>>>>>>>> Does that complicate things beyond how much a cardinality
>>>> requirement necessarily complicates things?
>>>>>>>>>>> 
>>>>>>>>>>> Well, the spec only needs to define the graph generated by the
>>>> mapping, so in terms of specification it would be a simple enough change.
>>>>>>>>>>> 
>>>>>>>>>>> The implications for implementers are quite significant though.
>>>> It's a new feature, the implementation costs are not trivial, no existing
>>>> implementation does this (AFAIK), so there's a certain amount of R&D
>>>> required to show that it's implementable.
>>>>>>>>>>> 
>>>>>>>>>>> Best,
>>>>>>>>>>> Richard
>>>>>>>>> 
>>>>>>>>> --
>>>>>>>>> -ericP
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>> 
>>>>> --
>>>>> -ericP
>>>>> 
>>>> 
>>>> 
>> 
>> -- 
>> -ericP
>> 
> 
> 


----
Ivan Herman, W3C Semantic Web Activity Lead
Home: http://www.w3.org/People/Ivan/
mobile: +31-641044153
FOAF: http://www.ivan-herman.net/foaf.rdf
Received on Friday, 4 May 2012 14:10:23 UTC