Re: Brain teaser for non-PK tables from Juan Sequeda on 2012-05-04 (public-rdb2rdf-wg@w3.org from May 2012)

From: Juan Sequeda <juanfederico@gmail.com>
Date: Fri, 4 May 2012 08:10:37 -0500
To: Ivan Herman <ivan@w3.org>
Cc: "Eric Prud'hommeaux" <eric@w3.org>, Richard Cyganiak <richard@cyganiak.de>, ashok malhotra <ashok.malhotra@oracle.com>, Michael Hausenblas <michael.hausenblas@deri.org>, W3C RDB2RDF <public-rdb2rdf-wg@w3.org>
Message-ID: <CAMVTWDwY2s8NQhu5QJOFzKf9+4jGuYmV3hp6tyxqX11BJG1qyg@mail.gmail.com>
On Fri, May 4, 2012 at 8:05 AM, Ivan Herman <ivan@w3.org> wrote:

> Eric,
>
> this seems to be a bit drastic for my taste; I would not want to burn the
> bridges between the R2RML and the DM. The fact that these two are closely
> related, that, *in general*, the DM is a default case for R2RML is, I
> believe, a strong feature, a good 'story'. I would not want to loose that.
>
> However, we have to face that there *are* cases when things do not really
> fit. What about modifying the two documents as follows (note that point #2
> is not strictly necessary for the discussion at hand, but it makes the
> relationships even clearer and stronger):
>
> 1. In the DM, instead of "is intended to provide a default behavior for
> R2RML: RDB to RDF Mapping Language" say "is intended to provide a default
> behavior for R2RML: RDB to RDF Mapping Language for tables which have at
> least one unique key"
>

+1

>
> 2. Add to the R2RML document (probably in the intro part): "R2RML
> implementations are encouraged to provide a default mapping equivalent to
> the Direct Mapping for tables which have at least one unique key"
>

+1

>
> 3. Add a Note to R2RML 6.1: "Because rr:IRI and rr:BlankNode subject
> labels are generated from column values, R2RML mappings do not preserve
> repeated rows in SQL databases."
>

+1

>
> How does that sound?
>
> Ivan
>
> On May 4, 2012, at 13:43 , Eric Prud'hommeaux wrote:
>
> > * Juan Sequeda <juanfederico@gmail.com> [2012-05-03 20:04-0500]
> >> All,
> >>
> >> 1) Technically we could (and maybe should) add this to the standard
> (both
> >> DM and R2RML) however...
> >> 2) We just realized about the problem now and somebody (Eric/Richard)
> came
> >> up with A solution. The rest of the standard has been built on years of
> >> experience. If this problem came up now just now, at the last minute, it
> >> means that nobody cared much about this before. That doesn't mean that
> they
> >> won't want it now. But it does mean that we should look into it with
> more
> >> detail, given that we know the issue exists. Down the road, we will
> know if
> >> it is feasible, etc
> >
> > We could move along more quickly if we:
> >
> >  1. strike "is intended to provide a default behavior for R2RML: RDB
> >     to RDF Mapping Language" from DM
> >
> >  2. add a Note to R2RML 6.1: "Because rr:IRI and rr:BlankNode subject
> >     labels are generated from column values, R2RML mappings do not
> >     preserve repeated rows in SQL databases.
> >
> > Adding a per-row blank node identifier in v1.1 will be completely
> > backward-compatible with v1.0.
> >
> >
> >> Juan Sequeda
> >> +1-575-SEQ-UEDA
> >> www.juansequeda.com
> >>
> >>
> >> On Thu, May 3, 2012 at 7:27 PM, Richard Cyganiak <richard@cyganiak.de
> >wrote:
> >>
> >>> Hi Eric,
> >>>
> >>> My short response is: The proposal is *optional*. You don't have to
> >>> implement it. You don't have to use implementations that don't support
> it.
> >>> It's just an extra sentence or two in the spec. There is clear guidance
> >>> which option implementers should support. What harm is there in
> allowing
> >>> the option?
> >>>
> >>> You offered one argument against providing this optional feature, and
> >>> that's the point about backwards compatibility. Future WGs may find it
> >>> difficult to remove this option even if the option becomes obsolete
> due to
> >>> a possible R2RML 1.1 update. I'll address this below.
> >>>
> >>> On 3 May 2012, at 22:36, Eric Prud'hommeaux wrote:
> >>>> * ashok malhotra <ashok.malhotra@oracle.com> [2012-05-03 12:22-0700]
> >>>>> +1 for option 2.  Seems less onerous.   Eric?
> >>>>
> >>>> It pains me that folks see me as obstructionist when I may well be
> >>>> saving us a 3rd LC. In June of 2006, Fred Zemke spotted a similar
> >>>> problem in the semantics of SPARQL wich took us six months to fix
> >>>> <http://www.w3.org/mid/4488B936.10705@oracle.com>.
> >>>
> >>> The problem in SPARQL was that it specified that implementations MUST
> NOT
> >>> use multiset semantics.
> >>>
> >>> The proposal on our table is to RECOMMEND multiset semantics, but state
> >>> that implementations MAY use set semantics for compatibility. This is
> not
> >>> comparable to the SPARQL situation.
> >>>
> >>> I also note that the 1st LC period and the CR period have passed
> without
> >>> any comments on issues of cardinality.
> >>>
> >>>> Speaking with Sam Madden, this seems like less of a corner case than
> >>>> we originally thought. He and Zemke asserted that while some base
> >>>> tables may have no uniques, it's more common for views materialized
> >>>> for performance to preserve only the information required to perform
> >>>> some aggregates. Before standardization of SQL, some relational DBs
> >>>> operated on sets, others on multisets, and some (Zemke worked on one
> >>>> called Britton Lee) preserved repeated rows until one did a
> >>>> sort. Customers, particularly those using views, had to be very
> >>>> careful in what order they performed various operations.
> >>>
> >>> Well, I can see why customers wouldn't be so happy about this, but it's
> >>> not quite the same thing here.
> >>>
> >>> The order of query operations doesn't matter in the proposed design.
> >>> SPARQL has multiset semantics, so even if you query a table with
> discarded
> >>> duplicates, the query execution is with the usual well-defined SPARQL
> >>> semantics. It's only in the mapping from non-PK tables to RDF graphs
> that
> >>> cardinality is not maintained.
> >>>
> >>>> Juan brought up fixing this in v1. It's easy for v1.1 to relax rigid
> >>>> constraints in v1.0, but most charters promise backward compatibility,
> >>>> so v1.1 can't impose restrictions not present in v1.0.
> >>>
> >>> That all depends on what we write into the spec, doesn't it? The DM
> spec
> >>> could state that the permission for discarding duplicate rows may be
> >>> removed in a future version, provided that a future R2RML adds a way of
> >>> preserving cardinality on no-PK tables.
> >>>
> >>>> Another issue is the performance of very common queries. Under
> >>>> multiset semantics, any query which either reports the name of an
> >>>> unnamed row requires the complex dance that Richard and I discussed.
> >>>
> >>> Yes, these queries are slow.
> >>>
> >>>> OTOH, under set semantics, any query which simply restricts or
> >>>> projects some row attributes requires a distinct subselect, which is
> >>>> either memory intensive or requires a sort of the table.
> >>>
> >>> Well, you forget about query optimization, see below.
> >>>
> >>>> For example,
> >>>> a simple join to get the addresses of folks with year-old debts:
> >>>>
> >>>> SELECT ?name ?city
> >>>>  WHERE {
> >>>>    ?debt <IOUs#name> ?name ;
> >>>>          <IOUs#date> ?date ;
> >>>>          <IOUs#addr> ?addr .
> >>>>    ?addr <Addresses#city> ?city
> >>>>    FILTER (?date < "2011-05-03"^^xsd:date)
> >>>>  }
> >>>>
> >>>> multiset SQL translation:
> >>>> SELECT name, city
> >>>>   FROM IOUs INNER JOIN Addresses ON IOUs.addr=Addresses.ID
> >>>>  WHERE date < "2011-05-03"
> >>>>
> >>>> set SQL translation:
> >>>> SELECT name, city
> >>>>   FROM (
> >>>>     SELECT DISTINCT name, date, addr, attr4, attr5
> >>>>       FROM IOUs
> >>>>      ) IOUs INNER JOIN Addresses ON IOUs.addr=Addresses.ID
> >>>>  WHERE date < "2011-05-03"
> >>>
> >>> Not having thought about this too hard, the second query doesn't seem
> >>> particularly bad. Isn't it equivalent to this?
> >>>
> >>> SELECT name, city
> >>>  FROM (
> >>>    SELECT DISTINCT name, date, addr, attr4, attr5
> >>>      FROM IOUs
> >>>      WHERE date < "2011-05-03"
> >>>      ) IOUs INNER JOIN Addresses ON IOUs.addr=Addresses.ID
> >>>
> >>> So the duplicate removal is only necessary over the subset of the table
> >>> that is actually being returned in the end. The INNER JOIN can also be
> >>> moved inside the DISTINCT, I think. The DISTINCT should then be O(n
> log n)
> >>> where n is the number of result rows, which isn't too bad.
> >>>
> >>> IIRC, DISTINCT can be moved up in the algebra tree over most other
> >>> operations, except for projections (which can usually be done last
> without
> >>> much performance impact), aggregates (which require more memory than
> >>> DISTINCT anyways) and LIMIT (which also limits the memory required for
> >>> DISTINCT).
> >>>
> >>> D2RQ is fairly smart about moving DISTINCTs around before generating
> the
> >>> final SQL query. I'd expect that most decent query optimizers are even
> >>> smarter than what we do.
> >>>
> >>>> One could make a pretty good case for preserving the intuitive and
> >>>> efficient query mapping for such common queries.
> >>>
> >>> 1. For many of these common queries, the DISTINCT is done on a reduced
> >>> intermediate result, or even on the final result set, and not on the
> input
> >>> data. So it's not that bad.
> >>>
> >>> 2. The strange contortions required for returning subjects may well
> >>> reverse the argument here. You make unproven assumptions about what
> queries
> >>> are common.
> >>>
> >>> 3. Again, the proposal is *not* to abandon the cardinality-preserving
> >>> query mapping. The proposal is to allow another query mapping as well,
> for
> >>> compatibility.
> >>>
> >>> Best,
> >>> Richard
> >>>
> >>>
> >>>
> >>>>
> >>>>
> >>>>> All the best, Ashok
> >>>>>
> >>>>> On 5/3/2012 12:10 PM, Juan Sequeda wrote:
> >>>>>>
> >>>>>>
> >>>>>> On Thu, May 3, 2012 at 2:01 PM, Richard Cyganiak <
> richard@cyganiak.de<mailto:
> >>> richard@cyganiak.de>> wrote:
> >>>>>>
> >>>>>>  On 3 May 2012, at 17:11, Juan Sequeda wrote:
> >>>>>>> Do you accept eric's proposal (which hasn't been stated yet):
> >>>>>>>
> >>>>>>> 1) Leave DM as-is
> >>>>>>> 2) Add the following to R2RML
> >>>>>>>
> >>>>>>> rr:subjectMap [
> >>>>>>>   rr:termType rr:RowBlankNode
> >>>>>>> ];
> >>>>>>
> >>>>>>  (I'd prefer calling it rr:BlankNode. The absence of
> >>> rr:column/rr:template/rr:constant indicates the new behaviour.)
> >>>>>>
> >>>>>>  This is a new feature that was never discussed before. It's not
> just
> >>> a tweak. No existing RDB2RDF mapping language has anything comparable.
> How
> >>> to sensibly implement it, is a somewhat open question, AFAIK. Had this
> been
> >>> proposed a few months ago, everyone would have said, “sounds like an
> R2RML
> >>> 1.1 feature” and we would have postponed it without complaints.
> >>>>>>
> >>>>>>  The problem at hand is the an incompatibility between two specs,
> >>> let's call them A and B, in a corner case. Now given these choices:
> >>>>>>
> >>>>>>  1) Add a new and somewhat risky feature to spec A, at a time when
> we
> >>> thought we were just about to enter PR. Send all implementers of A
> back to
> >>> the drawing board. Delay the WG for an indefinite amount of time, over
> a
> >>> barely relevant corner case.
> >>>>>>
> >>>>>>  2) Relax a constraint in spec B to say you SHOULD implement the
> >>> “correct” behaviour for this corner case, but MAY also implement
> another
> >>> not entirely unreasonable behaviour that is compatible with A as it
> is. Add
> >>> some alarming language and say: “We expect future versions of A to
> remove
> >>> this limitation.” No implementation changes. Go to PR in three weeks.
> >>>>>>
> >>>>>>  To me, 2) makes a lot more sense than 1).
> >>>>>>
> >>>>>>
> >>>>>> I agree with Richard. Option 2 seems more reasonable at the moment.
> >>>>>>
> >>>>>> We already have other issues to address for a R2RML and DM 1.1
> >>> version. This could be part of it. I'm not sure how this works in the
> >>> standardization process, but as a group, we believe this particular
> issue
> >>> is a corner case so it's not imperative to include it in the current
> >>> version of the standard. However, if users complain about this corner
> case
> >>> (we then realize that it isn't a corner case), we realize we were wrong
> >>> from the beginning. I'm guessing this sometimes (usually?) happens in
> >>> standards, right?
> >>>>>>
> >>>>>>
> >>>>>>  Best,
> >>>>>>  Richard
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> Juan Sequeda
> >>>>>>> +1-575-SEQ-UEDA
> >>>>>>> www.juansequeda.com <http://www.juansequeda.com>
> >>>>>>>
> >>>>>>>
> >>>>>>> On Thu, May 3, 2012 at 11:08 AM, Michael Hausenblas <
> >>> michael.hausenblas@deri.org <mailto:michael.hausenblas@deri.org>>
> wrote:
> >>>>>>>
> >>>>>>>> Were we close to closing R2RML's CR?
> >>>>>>>
> >>>>>>> This was the last issue, all other have been resolved in last weeks
> >>> meeting (see also my comments when I sent out the minutes [1]). Never
> mind,
> >>> we're not extending CR but entering a second, rather short LC period.
> >>>>>>>
> >>>>>>> Ivan, can you prepare a respective PROPOSAL for next week's meeting
> >>> please?
> >>>>>>>
> >>>>>>> Cheers,
> >>>>>>>         Michael
> >>>>>>>
> >>>>>>> [1]
> >>>
> http://lists.w3.org/Archives/Public/public-rdb2rdf-wg/2012May/0005.html
> >>>>>>>
> >>>>>>> --
> >>>>>>> Dr. Michael Hausenblas, Research Fellow
> >>>>>>> DERI - Digital Enterprise Research Institute
> >>>>>>> NUIG - National University of Ireland, Galway
> >>>>>>> Ireland, Europe
> >>>>>>> Tel.: +353 91 495730 <tel:%2B353%2091%20495730>
> >>>>>>> WebID: http://sw-app.org/mic.xhtml#i
> >>>>>>>
> >>>>>>> On 3 May 2012, at 17:04, Eric Prud'hommeaux wrote:
> >>>>>>>
> >>>>>>>> * Juan Sequeda <juanfederico@gmail.com <mailto:
> >>> juanfederico@gmail.com>> [2012-05-03 10:50-0500]
> >>>>>>>>> Looks like we have to extend CR till
> >>>>>>>>> we have implementations for this corner case.
> >>>>>>>>
> >>>>>>>> Were we close to closing R2RML's CR?
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>> Juan Sequeda
> >>>>>>>>> www.juansequeda.com <http://www.juansequeda.com>
> >>>>>>>>>
> >>>>>>>>> On May 3, 2012, at 10:42 AM, Richard Cyganiak <
> richard@cyganiak.de<mailto:
> >>> richard@cyganiak.de>> wrote:
> >>>>>>>>>
> >>>>>>>>>> On 3 May 2012, at 16:25, Eric Prud'hommeaux wrote:
> >>>>>>>>>>> presumes you can create tables, but yeah, conceptually easier
> >>> query.
> >>>>>>>>>>
> >>>>>>>>>> (It looks like most databases have a proprietary method of
> adding
> >>> the indexes that doesn't require write access to the DB.)
> >>>>>>>>>>
> >>>>>>>>>>> you can even push the symbol generation down:
> >>>>>>>>>>
> >>>>>>>>>> Right.
> >>>>>>>>>>
> >>>>>>>>>>>> The big remaining question is: How to handle this in R2RML?
> >>>>>>>>>>>
> >>>>>>>>>>> Looking for an analog to:
> >>>>>>>>>>> rr:subjectMap [
> >>>>>>>>>>>    rr:column "ROWID";
> >>>>>>>>>>>    rr:termType rr:BlankNode
> >>>>>>>>>>> ];
> >>>>>>>>>>> I'd propose:
> >>>>>>>>>>> rr:subjectMap [
> >>>>>>>>>>>    rr:termType rr:RowBlankNode
> >>>>>>>>>>> ];
> >>>>>>>>>>
> >>>>>>>>>> That's an option. Even keeping rr:BlankNode would work — the
> >>> absence of an rr:column/rr:template/rr:constant might signal that a
> fresh
> >>> blank node must be allocated for each row.
> >>>>>>>>>>
> >>>>>>>>>>> Does that complicate things beyond how much a cardinality
> >>> requirement necessarily complicates things?
> >>>>>>>>>>
> >>>>>>>>>> Well, the spec only needs to define the graph generated by the
> >>> mapping, so in terms of specification it would be a simple enough
> change.
> >>>>>>>>>>
> >>>>>>>>>> The implications for implementers are quite significant though.
> >>> It's a new feature, the implementation costs are not trivial, no
> existing
> >>> implementation does this (AFAIK), so there's a certain amount of R&D
> >>> required to show that it's implementable.
> >>>>>>>>>>
> >>>>>>>>>> Best,
> >>>>>>>>>> Richard
> >>>>>>>>
> >>>>>>>> --
> >>>>>>>> -ericP
> >>>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>>
> >>>>
> >>>> --
> >>>> -ericP
> >>>>
> >>>
> >>>
> >
> > --
> > -ericP
> >
>
>
> ----
> Ivan Herman, W3C Semantic Web Activity Lead
> Home: http://www.w3.org/People/Ivan/
> mobile: +31-641044153
> FOAF: http://www.ivan-herman.net/foaf.rdf
>
>
>
>
>
>
Received on Friday, 4 May 2012 13:11:30 UTC