Re: Brain teaser for non-PK tables from Juan Sequeda on 2012-05-04 (public-rdb2rdf-wg@w3.org from May 2012)

From: Juan Sequeda <juanfederico@gmail.com>
Date: Fri, 4 May 2012 08:33:33 -0500
To: Souripriya Das <SOURIPRIYA.DAS@oracle.com>
Cc: ivan@w3.org, ashok.malhotra@oracle.com, public-rdb2rdf-wg@w3.org, richard@cyganiak.de, michael.hausenblas@deri.org, eric@w3.org
Message-ID: <CAMVTWDwKKtL0QMw_ttVB9GNz+CRaetC8V1XsuyCsCjPV5SBGVA@mail.gmail.com>
I would substitute unique key fro primary key.

Juan Sequeda
+1-575-SEQ-UEDA
www.juansequeda.com


On Fri, May 4, 2012 at 8:32 AM, Souripriya Das <SOURIPRIYA.DAS@oracle.com>wrote:

> Since some commercial DBs allow NULL unique key,
> please consider replacing
> "tables which have at least one unique key"
> with
> "tables which have at least one non-NULL unique key"
>
> Thanks,
> - Souri.
>
> ----- Original Message -----
> From: ivan@w3.org
> To: juanfederico@gmail.com
> Cc: eric@w3.org, richard@cyganiak.de, ashok.malhotra@oracle.com,
> michael.hausenblas@deri.org, public-rdb2rdf-wg@w3.org
> Sent: Friday, May 4, 2012 9:22:32 AM GMT -05:00 US/Canada Eastern
> Subject: Re: Brain teaser for non-PK tables
>
>
> On May 4, 2012, at 15:18 , Juan Sequeda wrote:
>
> > This means that we would leave the DM as-is, right?
>
> On the technical side, yes. These changes are clarifications/editorial.
>
> Ivan
>
> >
> >
> > Juan Sequeda
> > +1-575-SEQ-UEDA
> > www.juansequeda.com
> >
> >
> > On Fri, May 4, 2012 at 8:15 AM, Eric Prud'hommeaux <eric@w3.org> wrote:
> > * Juan Sequeda <juanfederico@gmail.com> [2012-05-04 08:10-0500]
> > > On Fri, May 4, 2012 at 8:05 AM, Ivan Herman <ivan@w3.org> wrote:
> > >
> > > > Eric,
> > > >
> > > > this seems to be a bit drastic for my taste; I would not want to
> burn the
> > > > bridges between the R2RML and the DM. The fact that these two are
> closely
> > > > related, that, *in general*, the DM is a default case for R2RML is, I
> > > > believe, a strong feature, a good 'story'. I would not want to loose
> that.
> > > >
> > > > However, we have to face that there *are* cases when things do not
> really
> > > > fit. What about modifying the two documents as follows (note that
> point #2
> > > > is not strictly necessary for the discussion at hand, but it makes
> the
> > > > relationships even clearer and stronger):
> > > >
> > > > 1. In the DM, instead of "is intended to provide a default behavior
> for
> > > > R2RML: RDB to RDF Mapping Language" say "is intended to provide a
> default
> > > > behavior for R2RML: RDB to RDF Mapping Language for tables which
> have at
> > > > least one unique key"
> > > >
> > >
> > > +1
> >
> > +1
> >
> > > > 2. Add to the R2RML document (probably in the intro part): "R2RML
> > > > implementations are encouraged to provide a default mapping
> equivalent to
> > > > the Direct Mapping for tables which have at least one unique key"
> > > >
> > >
> > > +1
> >
> > +1
> >
> > > > 3. Add a Note to R2RML 6.1: "Because rr:IRI and rr:BlankNode subject
> > > > labels are generated from column values, R2RML mappings do not
> preserve
> > > > repeated rows in SQL databases."
> > > >
> > >
> > > +1
> >
> > +1
> >
> > > > How does that sound?
> > > >
> > > > Ivan
> > > >
> > > > On May 4, 2012, at 13:43 , Eric Prud'hommeaux wrote:
> > > >
> > > > > * Juan Sequeda <juanfederico@gmail.com> [2012-05-03 20:04-0500]
> > > > >> All,
> > > > >>
> > > > >> 1) Technically we could (and maybe should) add this to the
> standard
> > > > (both
> > > > >> DM and R2RML) however...
> > > > >> 2) We just realized about the problem now and somebody
> (Eric/Richard)
> > > > came
> > > > >> up with A solution. The rest of the standard has been built on
> years of
> > > > >> experience. If this problem came up now just now, at the last
> minute, it
> > > > >> means that nobody cared much about this before. That doesn't mean
> that
> > > > they
> > > > >> won't want it now. But it does mean that we should look into it
> with
> > > > more
> > > > >> detail, given that we know the issue exists. Down the road, we
> will
> > > > know if
> > > > >> it is feasible, etc
> > > > >
> > > > > We could move along more quickly if we:
> > > > >
> > > > >  1. strike "is intended to provide a default behavior for R2RML:
> RDB
> > > > >     to RDF Mapping Language" from DM
> > > > >
> > > > >  2. add a Note to R2RML 6.1: "Because rr:IRI and rr:BlankNode
> subject
> > > > >     labels are generated from column values, R2RML mappings do not
> > > > >     preserve repeated rows in SQL databases.
> > > > >
> > > > > Adding a per-row blank node identifier in v1.1 will be completely
> > > > > backward-compatible with v1.0.
> > > > >
> > > > >
> > > > >> Juan Sequeda
> > > > >> +1-575-SEQ-UEDA
> > > > >> www.juansequeda.com
> > > > >>
> > > > >>
> > > > >> On Thu, May 3, 2012 at 7:27 PM, Richard Cyganiak <
> richard@cyganiak.de
> > > > >wrote:
> > > > >>
> > > > >>> Hi Eric,
> > > > >>>
> > > > >>> My short response is: The proposal is *optional*. You don't have
> to
> > > > >>> implement it. You don't have to use implementations that don't
> support
> > > > it.
> > > > >>> It's just an extra sentence or two in the spec. There is clear
> guidance
> > > > >>> which option implementers should support. What harm is there in
> > > > allowing
> > > > >>> the option?
> > > > >>>
> > > > >>> You offered one argument against providing this optional
> feature, and
> > > > >>> that's the point about backwards compatibility. Future WGs may
> find it
> > > > >>> difficult to remove this option even if the option becomes
> obsolete
> > > > due to
> > > > >>> a possible R2RML 1.1 update. I'll address this below.
> > > > >>>
> > > > >>> On 3 May 2012, at 22:36, Eric Prud'hommeaux wrote:
> > > > >>>> * ashok malhotra <ashok.malhotra@oracle.com> [2012-05-03
> 12:22-0700]
> > > > >>>>> +1 for option 2.  Seems less onerous.   Eric?
> > > > >>>>
> > > > >>>> It pains me that folks see me as obstructionist when I may well
> be
> > > > >>>> saving us a 3rd LC. In June of 2006, Fred Zemke spotted a
> similar
> > > > >>>> problem in the semantics of SPARQL wich took us six months to
> fix
> > > > >>>> <http://www.w3.org/mid/4488B936.10705@oracle.com>.
> > > > >>>
> > > > >>> The problem in SPARQL was that it specified that implementations
> MUST
> > > > NOT
> > > > >>> use multiset semantics.
> > > > >>>
> > > > >>> The proposal on our table is to RECOMMEND multiset semantics,
> but state
> > > > >>> that implementations MAY use set semantics for compatibility.
> This is
> > > > not
> > > > >>> comparable to the SPARQL situation.
> > > > >>>
> > > > >>> I also note that the 1st LC period and the CR period have passed
> > > > without
> > > > >>> any comments on issues of cardinality.
> > > > >>>
> > > > >>>> Speaking with Sam Madden, this seems like less of a corner case
> than
> > > > >>>> we originally thought. He and Zemke asserted that while some
> base
> > > > >>>> tables may have no uniques, it's more common for views
> materialized
> > > > >>>> for performance to preserve only the information required to
> perform
> > > > >>>> some aggregates. Before standardization of SQL, some relational
> DBs
> > > > >>>> operated on sets, others on multisets, and some (Zemke worked
> on one
> > > > >>>> called Britton Lee) preserved repeated rows until one did a
> > > > >>>> sort. Customers, particularly those using views, had to be very
> > > > >>>> careful in what order they performed various operations.
> > > > >>>
> > > > >>> Well, I can see why customers wouldn't be so happy about this,
> but it's
> > > > >>> not quite the same thing here.
> > > > >>>
> > > > >>> The order of query operations doesn't matter in the proposed
> design.
> > > > >>> SPARQL has multiset semantics, so even if you query a table with
> > > > discarded
> > > > >>> duplicates, the query execution is with the usual well-defined
> SPARQL
> > > > >>> semantics. It's only in the mapping from non-PK tables to RDF
> graphs
> > > > that
> > > > >>> cardinality is not maintained.
> > > > >>>
> > > > >>>> Juan brought up fixing this in v1. It's easy for v1.1 to relax
> rigid
> > > > >>>> constraints in v1.0, but most charters promise backward
> compatibility,
> > > > >>>> so v1.1 can't impose restrictions not present in v1.0.
> > > > >>>
> > > > >>> That all depends on what we write into the spec, doesn't it? The
> DM
> > > > spec
> > > > >>> could state that the permission for discarding duplicate rows
> may be
> > > > >>> removed in a future version, provided that a future R2RML adds a
> way of
> > > > >>> preserving cardinality on no-PK tables.
> > > > >>>
> > > > >>>> Another issue is the performance of very common queries. Under
> > > > >>>> multiset semantics, any query which either reports the name of
> an
> > > > >>>> unnamed row requires the complex dance that Richard and I
> discussed.
> > > > >>>
> > > > >>> Yes, these queries are slow.
> > > > >>>
> > > > >>>> OTOH, under set semantics, any query which simply restricts or
> > > > >>>> projects some row attributes requires a distinct subselect,
> which is
> > > > >>>> either memory intensive or requires a sort of the table.
> > > > >>>
> > > > >>> Well, you forget about query optimization, see below.
> > > > >>>
> > > > >>>> For example,
> > > > >>>> a simple join to get the addresses of folks with year-old debts:
> > > > >>>>
> > > > >>>> SELECT ?name ?city
> > > > >>>>  WHERE {
> > > > >>>>    ?debt <IOUs#name> ?name ;
> > > > >>>>          <IOUs#date> ?date ;
> > > > >>>>          <IOUs#addr> ?addr .
> > > > >>>>    ?addr <Addresses#city> ?city
> > > > >>>>    FILTER (?date < "2011-05-03"^^xsd:date)
> > > > >>>>  }
> > > > >>>>
> > > > >>>> multiset SQL translation:
> > > > >>>> SELECT name, city
> > > > >>>>   FROM IOUs INNER JOIN Addresses ON IOUs.addr=Addresses.ID
> > > > >>>>  WHERE date < "2011-05-03"
> > > > >>>>
> > > > >>>> set SQL translation:
> > > > >>>> SELECT name, city
> > > > >>>>   FROM (
> > > > >>>>     SELECT DISTINCT name, date, addr, attr4, attr5
> > > > >>>>       FROM IOUs
> > > > >>>>      ) IOUs INNER JOIN Addresses ON IOUs.addr=Addresses.ID
> > > > >>>>  WHERE date < "2011-05-03"
> > > > >>>
> > > > >>> Not having thought about this too hard, the second query doesn't
> seem
> > > > >>> particularly bad. Isn't it equivalent to this?
> > > > >>>
> > > > >>> SELECT name, city
> > > > >>>  FROM (
> > > > >>>    SELECT DISTINCT name, date, addr, attr4, attr5
> > > > >>>      FROM IOUs
> > > > >>>      WHERE date < "2011-05-03"
> > > > >>>      ) IOUs INNER JOIN Addresses ON IOUs.addr=Addresses.ID
> > > > >>>
> > > > >>> So the duplicate removal is only necessary over the subset of
> the table
> > > > >>> that is actually being returned in the end. The INNER JOIN can
> also be
> > > > >>> moved inside the DISTINCT, I think. The DISTINCT should then be
> O(n
> > > > log n)
> > > > >>> where n is the number of result rows, which isn't too bad.
> > > > >>>
> > > > >>> IIRC, DISTINCT can be moved up in the algebra tree over most
> other
> > > > >>> operations, except for projections (which can usually be done
> last
> > > > without
> > > > >>> much performance impact), aggregates (which require more memory
> than
> > > > >>> DISTINCT anyways) and LIMIT (which also limits the memory
> required for
> > > > >>> DISTINCT).
> > > > >>>
> > > > >>> D2RQ is fairly smart about moving DISTINCTs around before
> generating
> > > > the
> > > > >>> final SQL query. I'd expect that most decent query optimizers
> are even
> > > > >>> smarter than what we do.
> > > > >>>
> > > > >>>> One could make a pretty good case for preserving the intuitive
> and
> > > > >>>> efficient query mapping for such common queries.
> > > > >>>
> > > > >>> 1. For many of these common queries, the DISTINCT is done on a
> reduced
> > > > >>> intermediate result, or even on the final result set, and not on
> the
> > > > input
> > > > >>> data. So it's not that bad.
> > > > >>>
> > > > >>> 2. The strange contortions required for returning subjects may
> well
> > > > >>> reverse the argument here. You make unproven assumptions about
> what
> > > > queries
> > > > >>> are common.
> > > > >>>
> > > > >>> 3. Again, the proposal is *not* to abandon the
> cardinality-preserving
> > > > >>> query mapping. The proposal is to allow another query mapping as
> well,
> > > > for
> > > > >>> compatibility.
> > > > >>>
> > > > >>> Best,
> > > > >>> Richard
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>>>
> > > > >>>>
> > > > >>>>> All the best, Ashok
> > > > >>>>>
> > > > >>>>> On 5/3/2012 12:10 PM, Juan Sequeda wrote:
> > > > >>>>>>
> > > > >>>>>>
> > > > >>>>>> On Thu, May 3, 2012 at 2:01 PM, Richard Cyganiak <
> > > > richard@cyganiak.de<mailto:
> > > > >>> richard@cyganiak.de>> wrote:
> > > > >>>>>>
> > > > >>>>>>  On 3 May 2012, at 17:11, Juan Sequeda wrote:
> > > > >>>>>>> Do you accept eric's proposal (which hasn't been stated yet):
> > > > >>>>>>>
> > > > >>>>>>> 1) Leave DM as-is
> > > > >>>>>>> 2) Add the following to R2RML
> > > > >>>>>>>
> > > > >>>>>>> rr:subjectMap [
> > > > >>>>>>>   rr:termType rr:RowBlankNode
> > > > >>>>>>> ];
> > > > >>>>>>
> > > > >>>>>>  (I'd prefer calling it rr:BlankNode. The absence of
> > > > >>> rr:column/rr:template/rr:constant indicates the new behaviour.)
> > > > >>>>>>
> > > > >>>>>>  This is a new feature that was never discussed before. It's
> not
> > > > just
> > > > >>> a tweak. No existing RDB2RDF mapping language has anything
> comparable.
> > > > How
> > > > >>> to sensibly implement it, is a somewhat open question, AFAIK.
> Had this
> > > > been
> > > > >>> proposed a few months ago, everyone would have said, “sounds
> like an
> > > > R2RML
> > > > >>> 1.1 feature” and we would have postponed it without complaints.
> > > > >>>>>>
> > > > >>>>>>  The problem at hand is the an incompatibility between two
> specs,
> > > > >>> let's call them A and B, in a corner case. Now given these
> choices:
> > > > >>>>>>
> > > > >>>>>>  1) Add a new and somewhat risky feature to spec A, at a time
> when
> > > > we
> > > > >>> thought we were just about to enter PR. Send all implementers of
> A
> > > > back to
> > > > >>> the drawing board. Delay the WG for an indefinite amount of
> time, over
> > > > a
> > > > >>> barely relevant corner case.
> > > > >>>>>>
> > > > >>>>>>  2) Relax a constraint in spec B to say you SHOULD implement
> the
> > > > >>> “correct” behaviour for this corner case, but MAY also implement
> > > > another
> > > > >>> not entirely unreasonable behaviour that is compatible with A as
> it
> > > > is. Add
> > > > >>> some alarming language and say: “We expect future versions of A
> to
> > > > remove
> > > > >>> this limitation.” No implementation changes. Go to PR in three
> weeks.
> > > > >>>>>>
> > > > >>>>>>  To me, 2) makes a lot more sense than 1).
> > > > >>>>>>
> > > > >>>>>>
> > > > >>>>>> I agree with Richard. Option 2 seems more reasonable at the
> moment.
> > > > >>>>>>
> > > > >>>>>> We already have other issues to address for a R2RML and DM 1.1
> > > > >>> version. This could be part of it. I'm not sure how this works
> in the
> > > > >>> standardization process, but as a group, we believe this
> particular
> > > > issue
> > > > >>> is a corner case so it's not imperative to include it in the
> current
> > > > >>> version of the standard. However, if users complain about this
> corner
> > > > case
> > > > >>> (we then realize that it isn't a corner case), we realize we
> were wrong
> > > > >>> from the beginning. I'm guessing this sometimes (usually?)
> happens in
> > > > >>> standards, right?
> > > > >>>>>>
> > > > >>>>>>
> > > > >>>>>>  Best,
> > > > >>>>>>  Richard
> > > > >>>>>>
> > > > >>>>>>
> > > > >>>>>>
> > > > >>>>>>>
> > > > >>>>>>>
> > > > >>>>>>> Juan Sequeda
> > > > >>>>>>> +1-575-SEQ-UEDA
> > > > >>>>>>> www.juansequeda.com <http://www.juansequeda.com>
> > > > >>>>>>>
> > > > >>>>>>>
> > > > >>>>>>> On Thu, May 3, 2012 at 11:08 AM, Michael Hausenblas <
> > > > >>> michael.hausenblas@deri.org <mailto:michael.hausenblas@deri.org
> >>
> > > > wrote:
> > > > >>>>>>>
> > > > >>>>>>>> Were we close to closing R2RML's CR?
> > > > >>>>>>>
> > > > >>>>>>> This was the last issue, all other have been resolved in
> last weeks
> > > > >>> meeting (see also my comments when I sent out the minutes [1]).
> Never
> > > > mind,
> > > > >>> we're not extending CR but entering a second, rather short LC
> period.
> > > > >>>>>>>
> > > > >>>>>>> Ivan, can you prepare a respective PROPOSAL for next week's
> meeting
> > > > >>> please?
> > > > >>>>>>>
> > > > >>>>>>> Cheers,
> > > > >>>>>>>         Michael
> > > > >>>>>>>
> > > > >>>>>>> [1]
> > > > >>>
> > > >
> http://lists.w3.org/Archives/Public/public-rdb2rdf-wg/2012May/0005.html
> > > > >>>>>>>
> > > > >>>>>>> --
> > > > >>>>>>> Dr. Michael Hausenblas, Research Fellow
> > > > >>>>>>> DERI - Digital Enterprise Research Institute
> > > > >>>>>>> NUIG - National University of Ireland, Galway
> > > > >>>>>>> Ireland, Europe
> > > > >>>>>>> Tel.: +353 91 495730 <tel:%2B353%2091%20495730>
> > > > >>>>>>> WebID: http://sw-app.org/mic.xhtml#i
> > > > >>>>>>>
> > > > >>>>>>> On 3 May 2012, at 17:04, Eric Prud'hommeaux wrote:
> > > > >>>>>>>
> > > > >>>>>>>> * Juan Sequeda <juanfederico@gmail.com <mailto:
> > > > >>> juanfederico@gmail.com>> [2012-05-03 10:50-0500]
> > > > >>>>>>>>> Looks like we have to extend CR till
> > > > >>>>>>>>> we have implementations for this corner case.
> > > > >>>>>>>>
> > > > >>>>>>>> Were we close to closing R2RML's CR?
> > > > >>>>>>>>
> > > > >>>>>>>>
> > > > >>>>>>>>> Juan Sequeda
> > > > >>>>>>>>> www.juansequeda.com <http://www.juansequeda.com>
> > > > >>>>>>>>>
> > > > >>>>>>>>> On May 3, 2012, at 10:42 AM, Richard Cyganiak <
> > > > richard@cyganiak.de<mailto:
> > > > >>> richard@cyganiak.de>> wrote:
> > > > >>>>>>>>>
> > > > >>>>>>>>>> On 3 May 2012, at 16:25, Eric Prud'hommeaux wrote:
> > > > >>>>>>>>>>> presumes you can create tables, but yeah, conceptually
> easier
> > > > >>> query.
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> (It looks like most databases have a proprietary method of
> > > > adding
> > > > >>> the indexes that doesn't require write access to the DB.)
> > > > >>>>>>>>>>
> > > > >>>>>>>>>>> you can even push the symbol generation down:
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> Right.
> > > > >>>>>>>>>>
> > > > >>>>>>>>>>>> The big remaining question is: How to handle this in
> R2RML?
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>> Looking for an analog to:
> > > > >>>>>>>>>>> rr:subjectMap [
> > > > >>>>>>>>>>>    rr:column "ROWID";
> > > > >>>>>>>>>>>    rr:termType rr:BlankNode
> > > > >>>>>>>>>>> ];
> > > > >>>>>>>>>>> I'd propose:
> > > > >>>>>>>>>>> rr:subjectMap [
> > > > >>>>>>>>>>>    rr:termType rr:RowBlankNode
> > > > >>>>>>>>>>> ];
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> That's an option. Even keeping rr:BlankNode would work —
> the
> > > > >>> absence of an rr:column/rr:template/rr:constant might signal
> that a
> > > > fresh
> > > > >>> blank node must be allocated for each row.
> > > > >>>>>>>>>>
> > > > >>>>>>>>>>> Does that complicate things beyond how much a cardinality
> > > > >>> requirement necessarily complicates things?
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> Well, the spec only needs to define the graph generated
> by the
> > > > >>> mapping, so in terms of specification it would be a simple enough
> > > > change.
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> The implications for implementers are quite significant
> though.
> > > > >>> It's a new feature, the implementation costs are not trivial, no
> > > > existing
> > > > >>> implementation does this (AFAIK), so there's a certain amount of
> R&D
> > > > >>> required to show that it's implementable.
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> Best,
> > > > >>>>>>>>>> Richard
> > > > >>>>>>>>
> > > > >>>>>>>> --
> > > > >>>>>>>> -ericP
> > > > >>>>>>>>
> > > > >>>>>>>
> > > > >>>>>>>
> > > > >>>>>>>
> > > > >>>>>>
> > > > >>>>>>
> > > > >>>>
> > > > >>>> --
> > > > >>>> -ericP
> > > > >>>>
> > > > >>>
> > > > >>>
> > > > >
> > > > > --
> > > > > -ericP
> > > > >
> > > >
> > > >
> > > > ----
> > > > Ivan Herman, W3C Semantic Web Activity Lead
> > > > Home: http://www.w3.org/People/Ivan/
> > > > mobile: +31-641044153
> > > > FOAF: http://www.ivan-herman.net/foaf.rdf
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> >
> > --
> > -ericP
> >
>
>
> ----
> Ivan Herman, W3C Semantic Web Activity Lead
> Home: http://www.w3.org/People/Ivan/
> mobile: +31-641044153
> FOAF: http://www.ivan-herman.net/foaf.rdf
>
>
>
>
>
>
>
Received on Friday, 4 May 2012 13:34:30 UTC