Re: ISSUE-9 Another question about Generate Blank Nodes from Juan Sequeda on 2011-02-01 (public-rdb2rdf-wg@w3.org from February 2011)

From: Juan Sequeda <juanfederico@gmail.com>
Date: Mon, 31 Jan 2011 23:10:32 -0600
To: "Eric Prud'hommeaux" <eric@w3.org>
Cc: RDB2RDF Working Group WG <public-rdb2rdf-wg@w3.org>
Message-ID: <AANLkTimoOjuGwGSA71maepRr-bqey27zX=81nHmZfF9=@mail.gmail.com>
late night, so I won't respond to all...


On Mon, Jan 31, 2011 at 1:34 PM, Eric Prud'hommeaux <eric@w3.org> wrote:

> * Juan Sequeda <juanfederico@gmail.com> [2011-01-31 10:11-0600]
> > This may be something we have talked about, so sorry if I'm asking about
> > something that already has an answer.
> >
> > We assume that a table that does not have a primary key will have a blank
> > node as the Row identifier for each tuple.
> >
> > But what happens if the table does not have a primary key but does have a
> > candidate key(s). Are we still generating a blank node as the Row
> identifier
> > for each tuple? Or could we consider building an IRI with the candidate
> > keys?
>
> That would make the rule a bit more complicated to explain to users
> and would lead to some design questions: Which candidate key would
> dominate when there were several to choose from (e.g. the Projects
> table)? How would the dominant key's value be available when
> generating reference triples which link to a non-dominant keys?
>

Excellent point! This would just open a can of worms.

So I guess that we can just keep it simple and state that if a table does
not have a pk and even though it has 1 or more candidate keys, we are
sticking with generating Blank Nodes.


> One use case I want to be sure to address is that of a typical
> warehouse merging data from multiple sources, re-populated at a
> regular interval (say 3am daily). Sometimes they don't have a primary
> key (the candidate keys serve for linking purposes) because those keys
> would change every day. Sometimes they do have a primary key but its
> volatility dictates that the key is a secret used only by the import
> scripts.
>
>
> > Consider the following example
> >
> > Schema
> > Projects(lead, name, deptName, deptCity) where UNIQUE(name, deptName,
> > deptCity)
>
> I read this as a superkey encompassing the two candidate keys
> described in
> <http://www.w3.org/2001/sw/rdb2rdf/directMapping/#ref-no-pk>.
>
> > Instances
> > Projects(8, pencil survey, accounting, cambridge)
> > Projects(8, eraser survey, accounting, cambridge)
> >
> > For each tuple we could create a fresh blank node, or we could create a
> Row
> > IRI for each tuple using the candidate key :
> >
> > <Projects/name=pencil survey,deptName=accounting,deptCity=cambridge>
> > <Projects/name=eraser survey,deptName=accounting,deptCity=cambridge>
> >
> > These IRIs are unique because they come from unique keys.
> >
> > What is the consensus here. I do not think this case is covered in the
> > current direct mapping doc (right Eric?)
>
> The modeling you're exploring isn't used in the direct mapping doc,
> but the use case is addressed. "Referencing tables with empty primary
> keys" includes the table with two unique keys and no primary keys that
> you describe above. The generated graph maintains referential
> integrity by labeling the triples from one row of the Projects table
> as _:c and using that as the object of all arcs which reference that
> row.
>
> I think the simple consistency of the current rule will appeal more to
> users and implementers. We now have two cases:
>
>  table has a primary key → row node is a function of that primary
>  key value.
>
>  table has no primary key → row node is a new blank node.
>
> We will otherwise have three cases:
>
>  table has a primary key (and any number of candidate keys) → row
>  node is a function of that primary key value.
>
>  table has no primary key and no canidate key → row node is a new
>  blank node.
>
>  table has no primary key and some canidate keys → row node is a
>  function of those candidate key values.
>
>
> > Cheers
> >
> > Juan Sequeda
> > +1-575-SEQ-UEDA
> > www.juansequeda.com
> >
> >
> > On Fri, Jan 21, 2011 at 2:41 PM, RDB2RDF Working Group Issue Tracker <
> > sysbot+tracker@w3.org <sysbot%2Btracker@w3.org> <sysbot%2Btracker@w3.org<sysbot%252Btracker@w3.org>>>
> wrote:
> >
> > >
> > > ISSUE-9 (bn_directmapping): Generate Blank Nodes for duplicate tuples
> > > [Direct Mapping]
> > >
> > > http://www.w3.org/2001/sw/rdb2rdf/track/issues/9
> > >
> > > Raised by: Juan Sequeda
> > > On product: Direct Mapping
> > >
> > > Given a table that does not have a primary key, which has duplicate
> tuples,
> > > a different blank node must be created for each tuple.
> > >
> > > In the Direct Mapping as rules section of the Direct Mapping document,
> we
> > > described this scenario by using all the values of the tuple to create
> the
> > > blank node [1] [2]. However, there is a bug, raised by Alexandre [3].
> The
> > > issue is that datalog cannot deal with duplicate. Consequently, Marcelo
> > > raised the point that we can use simple versions of datalog that can
> deal
> > > with duplicate solutions.
> > >
> > > Possible solutions:
> > >
> > > 1) assume that each table implicitly has a row id which is part of its
> set
> > > of attributes. The row id is unique.
> > > 2) associates to each tuple an annotation that corresponds to the
> > > multiplicity of the tuple in the database. This annotation function
> > > corresponds to the function card in the definition of the semantics of
> > > SPARQL
> > >
> > >
> > > [1]
> > >
> http://www.w3.org/TR/2010/WD-rdb-direct-mapping-20101118/#rules_table_triples_no_pk
> > > [2]
> > >
> http://www.w3.org/TR/2010/WD-rdb-direct-mapping-20101118/#rules_literal_triples_no_pk
> > > [3]
> > >
> http://lists.w3.org/Archives/Public/public-rdb2rdf-wg/2011Jan/0044.html
> > >
> > >
> > >
> > >
>
> --
> -ericP
>
Received on Tuesday, 1 February 2011 06:16:25 UTC