- From: Ivan Herman <ivan@w3.org>
- Date: Fri, 4 May 2012 16:05:27 +0200
- To: Souripriya Das <SOURIPRIYA.DAS@oracle.com>
- Cc: <ashok.malhotra@oracle.com>, <public-rdb2rdf-wg@w3.org>, <juanfederico@gmail.com>, <richard@cyganiak.de>, <michael.hausenblas@deri.org>, <eric@w3.org>
I defer, on that detail, to those of you who know these things better:-)
Ivan
On May 4, 2012, at 15:32 , Souripriya Das wrote:
> Since some commercial DBs allow NULL unique key,
> please consider replacing
> "tables which have at least one unique key"
> with
> "tables which have at least one non-NULL unique key"
>
> Thanks,
> - Souri.
>
> ----- Original Message -----
> From: ivan@w3.org
> To: juanfederico@gmail.com
> Cc: eric@w3.org, richard@cyganiak.de, ashok.malhotra@oracle.com, michael.hausenblas@deri.org, public-rdb2rdf-wg@w3.org
> Sent: Friday, May 4, 2012 9:22:32 AM GMT -05:00 US/Canada Eastern
> Subject: Re: Brain teaser for non-PK tables
>
>
> On May 4, 2012, at 15:18 , Juan Sequeda wrote:
>
>> This means that we would leave the DM as-is, right?
>
> On the technical side, yes. These changes are clarifications/editorial.
>
> Ivan
>
>>
>>
>> Juan Sequeda
>> +1-575-SEQ-UEDA
>> www.juansequeda.com
>>
>>
>> On Fri, May 4, 2012 at 8:15 AM, Eric Prud'hommeaux <eric@w3.org> wrote:
>> * Juan Sequeda <juanfederico@gmail.com> [2012-05-04 08:10-0500]
>>> On Fri, May 4, 2012 at 8:05 AM, Ivan Herman <ivan@w3.org> wrote:
>>>
>>>> Eric,
>>>>
>>>> this seems to be a bit drastic for my taste; I would not want to burn the
>>>> bridges between the R2RML and the DM. The fact that these two are closely
>>>> related, that, *in general*, the DM is a default case for R2RML is, I
>>>> believe, a strong feature, a good 'story'. I would not want to loose that.
>>>>
>>>> However, we have to face that there *are* cases when things do not really
>>>> fit. What about modifying the two documents as follows (note that point #2
>>>> is not strictly necessary for the discussion at hand, but it makes the
>>>> relationships even clearer and stronger):
>>>>
>>>> 1. In the DM, instead of "is intended to provide a default behavior for
>>>> R2RML: RDB to RDF Mapping Language" say "is intended to provide a default
>>>> behavior for R2RML: RDB to RDF Mapping Language for tables which have at
>>>> least one unique key"
>>>>
>>>
>>> +1
>>
>> +1
>>
>>>> 2. Add to the R2RML document (probably in the intro part): "R2RML
>>>> implementations are encouraged to provide a default mapping equivalent to
>>>> the Direct Mapping for tables which have at least one unique key"
>>>>
>>>
>>> +1
>>
>> +1
>>
>>>> 3. Add a Note to R2RML 6.1: "Because rr:IRI and rr:BlankNode subject
>>>> labels are generated from column values, R2RML mappings do not preserve
>>>> repeated rows in SQL databases."
>>>>
>>>
>>> +1
>>
>> +1
>>
>>>> How does that sound?
>>>>
>>>> Ivan
>>>>
>>>> On May 4, 2012, at 13:43 , Eric Prud'hommeaux wrote:
>>>>
>>>>> * Juan Sequeda <juanfederico@gmail.com> [2012-05-03 20:04-0500]
>>>>>> All,
>>>>>>
>>>>>> 1) Technically we could (and maybe should) add this to the standard
>>>> (both
>>>>>> DM and R2RML) however...
>>>>>> 2) We just realized about the problem now and somebody (Eric/Richard)
>>>> came
>>>>>> up with A solution. The rest of the standard has been built on years of
>>>>>> experience. If this problem came up now just now, at the last minute, it
>>>>>> means that nobody cared much about this before. That doesn't mean that
>>>> they
>>>>>> won't want it now. But it does mean that we should look into it with
>>>> more
>>>>>> detail, given that we know the issue exists. Down the road, we will
>>>> know if
>>>>>> it is feasible, etc
>>>>>
>>>>> We could move along more quickly if we:
>>>>>
>>>>> 1. strike "is intended to provide a default behavior for R2RML: RDB
>>>>> to RDF Mapping Language" from DM
>>>>>
>>>>> 2. add a Note to R2RML 6.1: "Because rr:IRI and rr:BlankNode subject
>>>>> labels are generated from column values, R2RML mappings do not
>>>>> preserve repeated rows in SQL databases.
>>>>>
>>>>> Adding a per-row blank node identifier in v1.1 will be completely
>>>>> backward-compatible with v1.0.
>>>>>
>>>>>
>>>>>> Juan Sequeda
>>>>>> +1-575-SEQ-UEDA
>>>>>> www.juansequeda.com
>>>>>>
>>>>>>
>>>>>> On Thu, May 3, 2012 at 7:27 PM, Richard Cyganiak <richard@cyganiak.de
>>>>> wrote:
>>>>>>
>>>>>>> Hi Eric,
>>>>>>>
>>>>>>> My short response is: The proposal is *optional*. You don't have to
>>>>>>> implement it. You don't have to use implementations that don't support
>>>> it.
>>>>>>> It's just an extra sentence or two in the spec. There is clear guidance
>>>>>>> which option implementers should support. What harm is there in
>>>> allowing
>>>>>>> the option?
>>>>>>>
>>>>>>> You offered one argument against providing this optional feature, and
>>>>>>> that's the point about backwards compatibility. Future WGs may find it
>>>>>>> difficult to remove this option even if the option becomes obsolete
>>>> due to
>>>>>>> a possible R2RML 1.1 update. I'll address this below.
>>>>>>>
>>>>>>> On 3 May 2012, at 22:36, Eric Prud'hommeaux wrote:
>>>>>>>> * ashok malhotra <ashok.malhotra@oracle.com> [2012-05-03 12:22-0700]
>>>>>>>>> +1 for option 2. Seems less onerous. Eric?
>>>>>>>>
>>>>>>>> It pains me that folks see me as obstructionist when I may well be
>>>>>>>> saving us a 3rd LC. In June of 2006, Fred Zemke spotted a similar
>>>>>>>> problem in the semantics of SPARQL wich took us six months to fix
>>>>>>>> <http://www.w3.org/mid/4488B936.10705@oracle.com>.
>>>>>>>
>>>>>>> The problem in SPARQL was that it specified that implementations MUST
>>>> NOT
>>>>>>> use multiset semantics.
>>>>>>>
>>>>>>> The proposal on our table is to RECOMMEND multiset semantics, but state
>>>>>>> that implementations MAY use set semantics for compatibility. This is
>>>> not
>>>>>>> comparable to the SPARQL situation.
>>>>>>>
>>>>>>> I also note that the 1st LC period and the CR period have passed
>>>> without
>>>>>>> any comments on issues of cardinality.
>>>>>>>
>>>>>>>> Speaking with Sam Madden, this seems like less of a corner case than
>>>>>>>> we originally thought. He and Zemke asserted that while some base
>>>>>>>> tables may have no uniques, it's more common for views materialized
>>>>>>>> for performance to preserve only the information required to perform
>>>>>>>> some aggregates. Before standardization of SQL, some relational DBs
>>>>>>>> operated on sets, others on multisets, and some (Zemke worked on one
>>>>>>>> called Britton Lee) preserved repeated rows until one did a
>>>>>>>> sort. Customers, particularly those using views, had to be very
>>>>>>>> careful in what order they performed various operations.
>>>>>>>
>>>>>>> Well, I can see why customers wouldn't be so happy about this, but it's
>>>>>>> not quite the same thing here.
>>>>>>>
>>>>>>> The order of query operations doesn't matter in the proposed design.
>>>>>>> SPARQL has multiset semantics, so even if you query a table with
>>>> discarded
>>>>>>> duplicates, the query execution is with the usual well-defined SPARQL
>>>>>>> semantics. It's only in the mapping from non-PK tables to RDF graphs
>>>> that
>>>>>>> cardinality is not maintained.
>>>>>>>
>>>>>>>> Juan brought up fixing this in v1. It's easy for v1.1 to relax rigid
>>>>>>>> constraints in v1.0, but most charters promise backward compatibility,
>>>>>>>> so v1.1 can't impose restrictions not present in v1.0.
>>>>>>>
>>>>>>> That all depends on what we write into the spec, doesn't it? The DM
>>>> spec
>>>>>>> could state that the permission for discarding duplicate rows may be
>>>>>>> removed in a future version, provided that a future R2RML adds a way of
>>>>>>> preserving cardinality on no-PK tables.
>>>>>>>
>>>>>>>> Another issue is the performance of very common queries. Under
>>>>>>>> multiset semantics, any query which either reports the name of an
>>>>>>>> unnamed row requires the complex dance that Richard and I discussed.
>>>>>>>
>>>>>>> Yes, these queries are slow.
>>>>>>>
>>>>>>>> OTOH, under set semantics, any query which simply restricts or
>>>>>>>> projects some row attributes requires a distinct subselect, which is
>>>>>>>> either memory intensive or requires a sort of the table.
>>>>>>>
>>>>>>> Well, you forget about query optimization, see below.
>>>>>>>
>>>>>>>> For example,
>>>>>>>> a simple join to get the addresses of folks with year-old debts:
>>>>>>>>
>>>>>>>> SELECT ?name ?city
>>>>>>>> WHERE {
>>>>>>>> ?debt <IOUs#name> ?name ;
>>>>>>>> <IOUs#date> ?date ;
>>>>>>>> <IOUs#addr> ?addr .
>>>>>>>> ?addr <Addresses#city> ?city
>>>>>>>> FILTER (?date < "2011-05-03"^^xsd:date)
>>>>>>>> }
>>>>>>>>
>>>>>>>> multiset SQL translation:
>>>>>>>> SELECT name, city
>>>>>>>> FROM IOUs INNER JOIN Addresses ON IOUs.addr=Addresses.ID
>>>>>>>> WHERE date < "2011-05-03"
>>>>>>>>
>>>>>>>> set SQL translation:
>>>>>>>> SELECT name, city
>>>>>>>> FROM (
>>>>>>>> SELECT DISTINCT name, date, addr, attr4, attr5
>>>>>>>> FROM IOUs
>>>>>>>> ) IOUs INNER JOIN Addresses ON IOUs.addr=Addresses.ID
>>>>>>>> WHERE date < "2011-05-03"
>>>>>>>
>>>>>>> Not having thought about this too hard, the second query doesn't seem
>>>>>>> particularly bad. Isn't it equivalent to this?
>>>>>>>
>>>>>>> SELECT name, city
>>>>>>> FROM (
>>>>>>> SELECT DISTINCT name, date, addr, attr4, attr5
>>>>>>> FROM IOUs
>>>>>>> WHERE date < "2011-05-03"
>>>>>>> ) IOUs INNER JOIN Addresses ON IOUs.addr=Addresses.ID
>>>>>>>
>>>>>>> So the duplicate removal is only necessary over the subset of the table
>>>>>>> that is actually being returned in the end. The INNER JOIN can also be
>>>>>>> moved inside the DISTINCT, I think. The DISTINCT should then be O(n
>>>> log n)
>>>>>>> where n is the number of result rows, which isn't too bad.
>>>>>>>
>>>>>>> IIRC, DISTINCT can be moved up in the algebra tree over most other
>>>>>>> operations, except for projections (which can usually be done last
>>>> without
>>>>>>> much performance impact), aggregates (which require more memory than
>>>>>>> DISTINCT anyways) and LIMIT (which also limits the memory required for
>>>>>>> DISTINCT).
>>>>>>>
>>>>>>> D2RQ is fairly smart about moving DISTINCTs around before generating
>>>> the
>>>>>>> final SQL query. I'd expect that most decent query optimizers are even
>>>>>>> smarter than what we do.
>>>>>>>
>>>>>>>> One could make a pretty good case for preserving the intuitive and
>>>>>>>> efficient query mapping for such common queries.
>>>>>>>
>>>>>>> 1. For many of these common queries, the DISTINCT is done on a reduced
>>>>>>> intermediate result, or even on the final result set, and not on the
>>>> input
>>>>>>> data. So it's not that bad.
>>>>>>>
>>>>>>> 2. The strange contortions required for returning subjects may well
>>>>>>> reverse the argument here. You make unproven assumptions about what
>>>> queries
>>>>>>> are common.
>>>>>>>
>>>>>>> 3. Again, the proposal is *not* to abandon the cardinality-preserving
>>>>>>> query mapping. The proposal is to allow another query mapping as well,
>>>> for
>>>>>>> compatibility.
>>>>>>>
>>>>>>> Best,
>>>>>>> Richard
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> All the best, Ashok
>>>>>>>>>
>>>>>>>>> On 5/3/2012 12:10 PM, Juan Sequeda wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Thu, May 3, 2012 at 2:01 PM, Richard Cyganiak <
>>>> richard@cyganiak.de<mailto:
>>>>>>> richard@cyganiak.de>> wrote:
>>>>>>>>>>
>>>>>>>>>> On 3 May 2012, at 17:11, Juan Sequeda wrote:
>>>>>>>>>>> Do you accept eric's proposal (which hasn't been stated yet):
>>>>>>>>>>>
>>>>>>>>>>> 1) Leave DM as-is
>>>>>>>>>>> 2) Add the following to R2RML
>>>>>>>>>>>
>>>>>>>>>>> rr:subjectMap [
>>>>>>>>>>> rr:termType rr:RowBlankNode
>>>>>>>>>>> ];
>>>>>>>>>>
>>>>>>>>>> (I'd prefer calling it rr:BlankNode. The absence of
>>>>>>> rr:column/rr:template/rr:constant indicates the new behaviour.)
>>>>>>>>>>
>>>>>>>>>> This is a new feature that was never discussed before. It's not
>>>> just
>>>>>>> a tweak. No existing RDB2RDF mapping language has anything comparable.
>>>> How
>>>>>>> to sensibly implement it, is a somewhat open question, AFAIK. Had this
>>>> been
>>>>>>> proposed a few months ago, everyone would have said, “sounds like an
>>>> R2RML
>>>>>>> 1.1 feature” and we would have postponed it without complaints.
>>>>>>>>>>
>>>>>>>>>> The problem at hand is the an incompatibility between two specs,
>>>>>>> let's call them A and B, in a corner case. Now given these choices:
>>>>>>>>>>
>>>>>>>>>> 1) Add a new and somewhat risky feature to spec A, at a time when
>>>> we
>>>>>>> thought we were just about to enter PR. Send all implementers of A
>>>> back to
>>>>>>> the drawing board. Delay the WG for an indefinite amount of time, over
>>>> a
>>>>>>> barely relevant corner case.
>>>>>>>>>>
>>>>>>>>>> 2) Relax a constraint in spec B to say you SHOULD implement the
>>>>>>> “correct” behaviour for this corner case, but MAY also implement
>>>> another
>>>>>>> not entirely unreasonable behaviour that is compatible with A as it
>>>> is. Add
>>>>>>> some alarming language and say: “We expect future versions of A to
>>>> remove
>>>>>>> this limitation.” No implementation changes. Go to PR in three weeks.
>>>>>>>>>>
>>>>>>>>>> To me, 2) makes a lot more sense than 1).
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I agree with Richard. Option 2 seems more reasonable at the moment.
>>>>>>>>>>
>>>>>>>>>> We already have other issues to address for a R2RML and DM 1.1
>>>>>>> version. This could be part of it. I'm not sure how this works in the
>>>>>>> standardization process, but as a group, we believe this particular
>>>> issue
>>>>>>> is a corner case so it's not imperative to include it in the current
>>>>>>> version of the standard. However, if users complain about this corner
>>>> case
>>>>>>> (we then realize that it isn't a corner case), we realize we were wrong
>>>>>>> from the beginning. I'm guessing this sometimes (usually?) happens in
>>>>>>> standards, right?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Best,
>>>>>>>>>> Richard
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Juan Sequeda
>>>>>>>>>>> +1-575-SEQ-UEDA
>>>>>>>>>>> www.juansequeda.com <http://www.juansequeda.com>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Thu, May 3, 2012 at 11:08 AM, Michael Hausenblas <
>>>>>>> michael.hausenblas@deri.org <mailto:michael.hausenblas@deri.org>>
>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Were we close to closing R2RML's CR?
>>>>>>>>>>>
>>>>>>>>>>> This was the last issue, all other have been resolved in last weeks
>>>>>>> meeting (see also my comments when I sent out the minutes [1]). Never
>>>> mind,
>>>>>>> we're not extending CR but entering a second, rather short LC period.
>>>>>>>>>>>
>>>>>>>>>>> Ivan, can you prepare a respective PROPOSAL for next week's meeting
>>>>>>> please?
>>>>>>>>>>>
>>>>>>>>>>> Cheers,
>>>>>>>>>>> Michael
>>>>>>>>>>>
>>>>>>>>>>> [1]
>>>>>>>
>>>> http://lists.w3.org/Archives/Public/public-rdb2rdf-wg/2012May/0005.html
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Dr. Michael Hausenblas, Research Fellow
>>>>>>>>>>> DERI - Digital Enterprise Research Institute
>>>>>>>>>>> NUIG - National University of Ireland, Galway
>>>>>>>>>>> Ireland, Europe
>>>>>>>>>>> Tel.: +353 91 495730 <tel:%2B353%2091%20495730>
>>>>>>>>>>> WebID: http://sw-app.org/mic.xhtml#i
>>>>>>>>>>>
>>>>>>>>>>> On 3 May 2012, at 17:04, Eric Prud'hommeaux wrote:
>>>>>>>>>>>
>>>>>>>>>>>> * Juan Sequeda <juanfederico@gmail.com <mailto:
>>>>>>> juanfederico@gmail.com>> [2012-05-03 10:50-0500]
>>>>>>>>>>>>> Looks like we have to extend CR till
>>>>>>>>>>>>> we have implementations for this corner case.
>>>>>>>>>>>>
>>>>>>>>>>>> Were we close to closing R2RML's CR?
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> Juan Sequeda
>>>>>>>>>>>>> www.juansequeda.com <http://www.juansequeda.com>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On May 3, 2012, at 10:42 AM, Richard Cyganiak <
>>>> richard@cyganiak.de<mailto:
>>>>>>> richard@cyganiak.de>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 3 May 2012, at 16:25, Eric Prud'hommeaux wrote:
>>>>>>>>>>>>>>> presumes you can create tables, but yeah, conceptually easier
>>>>>>> query.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> (It looks like most databases have a proprietary method of
>>>> adding
>>>>>>> the indexes that doesn't require write access to the DB.)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> you can even push the symbol generation down:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Right.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The big remaining question is: How to handle this in R2RML?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Looking for an analog to:
>>>>>>>>>>>>>>> rr:subjectMap [
>>>>>>>>>>>>>>> rr:column "ROWID";
>>>>>>>>>>>>>>> rr:termType rr:BlankNode
>>>>>>>>>>>>>>> ];
>>>>>>>>>>>>>>> I'd propose:
>>>>>>>>>>>>>>> rr:subjectMap [
>>>>>>>>>>>>>>> rr:termType rr:RowBlankNode
>>>>>>>>>>>>>>> ];
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> That's an option. Even keeping rr:BlankNode would work — the
>>>>>>> absence of an rr:column/rr:template/rr:constant might signal that a
>>>> fresh
>>>>>>> blank node must be allocated for each row.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Does that complicate things beyond how much a cardinality
>>>>>>> requirement necessarily complicates things?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Well, the spec only needs to define the graph generated by the
>>>>>>> mapping, so in terms of specification it would be a simple enough
>>>> change.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The implications for implementers are quite significant though.
>>>>>>> It's a new feature, the implementation costs are not trivial, no
>>>> existing
>>>>>>> implementation does this (AFAIK), so there's a certain amount of R&D
>>>>>>> required to show that it's implementable.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>> Richard
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> -ericP
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> -ericP
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>
>>>>> --
>>>>> -ericP
>>>>>
>>>>
>>>>
>>>> ----
>>>> Ivan Herman, W3C Semantic Web Activity Lead
>>>> Home: http://www.w3.org/People/Ivan/
>>>> mobile: +31-641044153
>>>> FOAF: http://www.ivan-herman.net/foaf.rdf
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>
>> --
>> -ericP
>>
>
>
> ----
> Ivan Herman, W3C Semantic Web Activity Lead
> Home: http://www.w3.org/People/Ivan/
> mobile: +31-641044153
> FOAF: http://www.ivan-herman.net/foaf.rdf
>
>
>
>
>
>
>
----
Ivan Herman, W3C Semantic Web Activity Lead
Home: http://www.w3.org/People/Ivan/
mobile: +31-641044153
FOAF: http://www.ivan-herman.net/foaf.rdf
Received on Friday, 4 May 2012 14:02:59 UTC