Re: Brain teaser for non-PK tables from Richard Cyganiak on 2012-05-04 (public-rdb2rdf-wg@w3.org from May 2012)

From: Richard Cyganiak <richard@cyganiak.de>
Date: Fri, 4 May 2012 01:27:11 +0100
To: Eric Prud'hommeaux <eric@w3.org>
Cc: ashok malhotra <ashok.malhotra@oracle.com>, Juan Sequeda <juanfederico@gmail.com>, Michael Hausenblas <michael.hausenblas@deri.org>, Ivan Herman <ivan@w3.org>, W3C RDB2RDF <public-rdb2rdf-wg@w3.org>
Message-Id: <0941947B-7641-4DE9-90A9-830AFCD130D6@cyganiak.de>
Hi Eric,

My short response is: The proposal is *optional*. You don't have to implement it. You don't have to use implementations that don't support it. It's just an extra sentence or two in the spec. There is clear guidance which option implementers should support. What harm is there in allowing the option?

You offered one argument against providing this optional feature, and that's the point about backwards compatibility. Future WGs may find it difficult to remove this option even if the option becomes obsolete due to a possible R2RML 1.1 update. I'll address this below.

On 3 May 2012, at 22:36, Eric Prud'hommeaux wrote:
> * ashok malhotra <ashok.malhotra@oracle.com> [2012-05-03 12:22-0700]
>> +1 for option 2.  Seems less onerous.   Eric?
> 
> It pains me that folks see me as obstructionist when I may well be
> saving us a 3rd LC. In June of 2006, Fred Zemke spotted a similar
> problem in the semantics of SPARQL wich took us six months to fix
> <http://www.w3.org/mid/4488B936.10705@oracle.com>.

The problem in SPARQL was that it specified that implementations MUST NOT use multiset semantics.

The proposal on our table is to RECOMMEND multiset semantics, but state that implementations MAY use set semantics for compatibility. This is not comparable to the SPARQL situation.

I also note that the 1st LC period and the CR period have passed without any comments on issues of cardinality.

> Speaking with Sam Madden, this seems like less of a corner case than
> we originally thought. He and Zemke asserted that while some base
> tables may have no uniques, it's more common for views materialized
> for performance to preserve only the information required to perform
> some aggregates. Before standardization of SQL, some relational DBs
> operated on sets, others on multisets, and some (Zemke worked on one
> called Britton Lee) preserved repeated rows until one did a
> sort. Customers, particularly those using views, had to be very
> careful in what order they performed various operations.

Well, I can see why customers wouldn't be so happy about this, but it's not quite the same thing here.

The order of query operations doesn't matter in the proposed design. SPARQL has multiset semantics, so even if you query a table with discarded duplicates, the query execution is with the usual well-defined SPARQL semantics. It's only in the mapping from non-PK tables to RDF graphs that cardinality is not maintained.

> Juan brought up fixing this in v1. It's easy for v1.1 to relax rigid
> constraints in v1.0, but most charters promise backward compatibility,
> so v1.1 can't impose restrictions not present in v1.0.

That all depends on what we write into the spec, doesn't it? The DM spec could state that the permission for discarding duplicate rows may be removed in a future version, provided that a future R2RML adds a way of preserving cardinality on no-PK tables.

> Another issue is the performance of very common queries. Under
> multiset semantics, any query which either reports the name of an
> unnamed row requires the complex dance that Richard and I discussed.

Yes, these queries are slow.

> OTOH, under set semantics, any query which simply restricts or
> projects some row attributes requires a distinct subselect, which is
> either memory intensive or requires a sort of the table.

Well, you forget about query optimization, see below.

> For example,
> a simple join to get the addresses of folks with year-old debts:
> 
>  SELECT ?name ?city
>   WHERE {
>     ?debt <IOUs#name> ?name ;
>           <IOUs#date> ?date ;
>           <IOUs#addr> ?addr .
>     ?addr <Addresses#city> ?city
>     FILTER (?date < "2011-05-03"^^xsd:date)
>   }
> 
> multiset SQL translation:
>  SELECT name, city
>    FROM IOUs INNER JOIN Addresses ON IOUs.addr=Addresses.ID
>   WHERE date < "2011-05-03"
> 
> set SQL translation:
>  SELECT name, city
>    FROM (
>      SELECT DISTINCT name, date, addr, attr4, attr5
>        FROM IOUs
>       ) IOUs INNER JOIN Addresses ON IOUs.addr=Addresses.ID
>   WHERE date < "2011-05-03"

Not having thought about this too hard, the second query doesn't seem particularly bad. Isn't it equivalent to this?

 SELECT name, city
   FROM (
     SELECT DISTINCT name, date, addr, attr4, attr5
       FROM IOUs
      WHERE date < "2011-05-03"
      ) IOUs INNER JOIN Addresses ON IOUs.addr=Addresses.ID

So the duplicate removal is only necessary over the subset of the table that is actually being returned in the end. The INNER JOIN can also be moved inside the DISTINCT, I think. The DISTINCT should then be O(n log n) where n is the number of result rows, which isn't too bad.

IIRC, DISTINCT can be moved up in the algebra tree over most other operations, except for projections (which can usually be done last without much performance impact), aggregates (which require more memory than DISTINCT anyways) and LIMIT (which also limits the memory required for DISTINCT).

D2RQ is fairly smart about moving DISTINCTs around before generating the final SQL query. I'd expect that most decent query optimizers are even smarter than what we do.

> One could make a pretty good case for preserving the intuitive and
> efficient query mapping for such common queries.

1. For many of these common queries, the DISTINCT is done on a reduced intermediate result, or even on the final result set, and not on the input data. So it's not that bad.

2. The strange contortions required for returning subjects may well reverse the argument here. You make unproven assumptions about what queries are common.

3. Again, the proposal is *not* to abandon the cardinality-preserving query mapping. The proposal is to allow another query mapping as well, for compatibility.

Best,
Richard



> 
> 
>> All the best, Ashok
>> 
>> On 5/3/2012 12:10 PM, Juan Sequeda wrote:
>>> 
>>> 
>>> On Thu, May 3, 2012 at 2:01 PM, Richard Cyganiak <richard@cyganiak.de <mailto:richard@cyganiak.de>> wrote:
>>> 
>>>   On 3 May 2012, at 17:11, Juan Sequeda wrote:
>>>> Do you accept eric's proposal (which hasn't been stated yet):
>>>> 
>>>> 1) Leave DM as-is
>>>> 2) Add the following to R2RML
>>>> 
>>>> rr:subjectMap [
>>>>    rr:termType rr:RowBlankNode
>>>>  ];
>>> 
>>>   (I'd prefer calling it rr:BlankNode. The absence of rr:column/rr:template/rr:constant indicates the new behaviour.)
>>> 
>>>   This is a new feature that was never discussed before. It's not just a tweak. No existing RDB2RDF mapping language has anything comparable. How to sensibly implement it, is a somewhat open question, AFAIK. Had this been proposed a few months ago, everyone would have said, “sounds like an R2RML 1.1 feature” and we would have postponed it without complaints.
>>> 
>>>   The problem at hand is the an incompatibility between two specs, let's call them A and B, in a corner case. Now given these choices:
>>> 
>>>   1) Add a new and somewhat risky feature to spec A, at a time when we thought we were just about to enter PR. Send all implementers of A back to the drawing board. Delay the WG for an indefinite amount of time, over a barely relevant corner case.
>>> 
>>>   2) Relax a constraint in spec B to say you SHOULD implement the “correct” behaviour for this corner case, but MAY also implement another not entirely unreasonable behaviour that is compatible with A as it is. Add some alarming language and say: “We expect future versions of A to remove this limitation.” No implementation changes. Go to PR in three weeks.
>>> 
>>>   To me, 2) makes a lot more sense than 1).
>>> 
>>> 
>>> I agree with Richard. Option 2 seems more reasonable at the moment.
>>> 
>>> We already have other issues to address for a R2RML and DM 1.1 version. This could be part of it. I'm not sure how this works in the standardization process, but as a group, we believe this particular issue is a corner case so it's not imperative to include it in the current version of the standard. However, if users complain about this corner case (we then realize that it isn't a corner case), we realize we were wrong from the beginning. I'm guessing this sometimes (usually?) happens in standards, right?
>>> 
>>> 
>>>   Best,
>>>   Richard
>>> 
>>> 
>>> 
>>>> 
>>>> 
>>>> Juan Sequeda
>>>> +1-575-SEQ-UEDA
>>>> www.juansequeda.com <http://www.juansequeda.com>
>>>> 
>>>> 
>>>> On Thu, May 3, 2012 at 11:08 AM, Michael Hausenblas <michael.hausenblas@deri.org <mailto:michael.hausenblas@deri.org>> wrote:
>>>> 
>>>>> Were we close to closing R2RML's CR?
>>>> 
>>>> This was the last issue, all other have been resolved in last weeks meeting (see also my comments when I sent out the minutes [1]). Never mind, we're not extending CR but entering a second, rather short LC period.
>>>> 
>>>> Ivan, can you prepare a respective PROPOSAL for next week's meeting please?
>>>> 
>>>> Cheers,
>>>>          Michael
>>>> 
>>>> [1] http://lists.w3.org/Archives/Public/public-rdb2rdf-wg/2012May/0005.html
>>>> 
>>>> --
>>>> Dr. Michael Hausenblas, Research Fellow
>>>> DERI - Digital Enterprise Research Institute
>>>> NUIG - National University of Ireland, Galway
>>>> Ireland, Europe
>>>> Tel.: +353 91 495730 <tel:%2B353%2091%20495730>
>>>> WebID: http://sw-app.org/mic.xhtml#i
>>>> 
>>>> On 3 May 2012, at 17:04, Eric Prud'hommeaux wrote:
>>>> 
>>>>> * Juan Sequeda <juanfederico@gmail.com <mailto:juanfederico@gmail.com>> [2012-05-03 10:50-0500]
>>>>>> Looks like we have to extend CR till
>>>>>> we have implementations for this corner case.
>>>>> 
>>>>> Were we close to closing R2RML's CR?
>>>>> 
>>>>> 
>>>>>> Juan Sequeda
>>>>>> www.juansequeda.com <http://www.juansequeda.com>
>>>>>> 
>>>>>> On May 3, 2012, at 10:42 AM, Richard Cyganiak <richard@cyganiak.de <mailto:richard@cyganiak.de>> wrote:
>>>>>> 
>>>>>>> On 3 May 2012, at 16:25, Eric Prud'hommeaux wrote:
>>>>>>>> presumes you can create tables, but yeah, conceptually easier query.
>>>>>>> 
>>>>>>> (It looks like most databases have a proprietary method of adding the indexes that doesn't require write access to the DB.)
>>>>>>> 
>>>>>>>> you can even push the symbol generation down:
>>>>>>> 
>>>>>>> Right.
>>>>>>> 
>>>>>>>>> The big remaining question is: How to handle this in R2RML?
>>>>>>>> 
>>>>>>>> Looking for an analog to:
>>>>>>>> rr:subjectMap [
>>>>>>>>     rr:column "ROWID";
>>>>>>>>     rr:termType rr:BlankNode
>>>>>>>>  ];
>>>>>>>> I'd propose:
>>>>>>>> rr:subjectMap [
>>>>>>>>     rr:termType rr:RowBlankNode
>>>>>>>>  ];
>>>>>>> 
>>>>>>> That's an option. Even keeping rr:BlankNode would work — the absence of an rr:column/rr:template/rr:constant might signal that a fresh blank node must be allocated for each row.
>>>>>>> 
>>>>>>>> Does that complicate things beyond how much a cardinality requirement necessarily complicates things?
>>>>>>> 
>>>>>>> Well, the spec only needs to define the graph generated by the mapping, so in terms of specification it would be a simple enough change.
>>>>>>> 
>>>>>>> The implications for implementers are quite significant though. It's a new feature, the implementation costs are not trivial, no existing implementation does this (AFAIK), so there's a certain amount of R&D required to show that it's implementable.
>>>>>>> 
>>>>>>> Best,
>>>>>>> Richard
>>>>> 
>>>>> --
>>>>> -ericP
>>>>> 
>>>> 
>>>> 
>>>> 
>>> 
>>> 
> 
> -- 
> -ericP
>
Received on Friday, 4 May 2012 00:27:43 UTC