example SQL mapping to direct graph

* Alexandre Bertails <bertails@w3.org> [2011-01-19 09:07-0500]
> On Wed, 2011-01-19 at 14:03 +0100, Boris Villazón Terrazas wrote:
> > On 19/01/2011 13:22, Harry Halpin wrote:
> > >> On Jan 18, 2011, at 19:05 , Alexandre Bertails wrote:
> > >> [snip]
> > >>> Juan, in order to help you with your action, I would like you to
> > >>> considerer the following:
> > >>>
> > >>> [[
> > >>> CREATE TABLE Debts (
> > >>> Name varchar(50),
> > >>> Amount Integer
> > >>> );
> > >>> INSERT INTO Debts (Name, Amount) VALUES("juan", 50);
> > >>> INSERT INTO Debts (Name, Amount) VALUES("juan", 50);
> > >>> ]]
> > >>>
> > >>> Using this very simple RDB [1] example, can you go through all the
> > >>> phases that lead to the RDF where I owe you 100?
> > >>
> > >> Alex, for my understanding: what I would expect to see in RDF are two
> > >> pairs of identical triples with different subjects. How would a direct
> > >> produce anything whereby I owe Juan 100? I will owe 50 twice, but the fact
> > >> that this is a hundred is a step that the RDF per se cannot say...
> > > Otherwise known as "RDF can't add, and (up until recently) neither can
> > > SPARQL". Which is a feature, not a bug, re decidability :)
> > >
> > > However, I'm assuming Alex is asking with Datalog semantics, do you get
> > > with those two insert statements *two* distinct pairs of triples
> > > (multiset) or just one (set)?
> > >
> > > Multisets have repeated membership, while sets don't I think.
> > Thanks Alex for pointing out this case.
> > This case was included by Eric within the test cases
> > http://www.w3.org/2001/sw/rdb2rdf/wiki/R2RML_Test_Cases_v1#DB5
> 
> You're right. I just made simpler and put it in the body of the email so
> I was sure that people would look at it :-)
> 
> > So, as you said the question is what is the default mapping for this case?
> 
> Two triples with different bnodes as subjects.
> 
> > how we can face this case with R2RML? and what would be the expected result?
> 
> I have no idea.

I think "facing the case" means making sure the cardinality of SPARQL
result set matches that of the intuitive SQL result table. For instance,
  SELECT ?to ?amnt { ?debt Debts:Name ?to ; Debts:Amount ?amnt }
should give back two rows when
  SELECT Debts.Name, Debts.Amount FROM Debts
would give two rows. Likewise
  SELECT SUM(?amnt) { ?debt :Amount ?amnt }
should be the same as
  SELECT SUM(Debts.amnt) FROM Debts
. This is handled in the direct mapping by creating a blank node for
each row in each table which does not have a primary key. You could
handle this in Datalog by ammending
  http://www.w3.org/mid/20101116000533.GI13630@w3.org
to include an SQL sequence generator (à la PRIMARY KEY AUTO
INCREMENT) in a copy of each of the database tables which has no
primary key. Tweaking from an earlier proposal:

* Eric Prud'hommeaux <eric@w3.org> [2010-11-15 19:05-0500]
> The current text assumes the trivial mapping from an SQL table to
> datalog facts. Eliding column names, we see:
> 
>   ┌┤foo├────┬────┐
>   │ X  │ Y  │ Z  │
>   │ x1 │ y1 │ z1 │ => foo(x1, y1, z1)
>   │ x2 │ y2 │ z2 │    foo(x2, y2, z2)
>   └────┴────┴────┘
> 
> If we use, yes, HOL, to create our datalog facts, we can capture
> things like the relation name, attribute (column) names, even the
> datatypes and keys. (This effectively recapitulates the catalog.)
> 
>   ┌┤foo.schema├─────┬────┬───────┬────┬──────┐
>   │ name │ a1 │ t1  │ a2 │ t2    │ a3 │ t3   │
>   │  foo │  X │ int │  Y │ float │  Z │ blob │
>   └──────┴────┴─────┴────┴───────┴────┴──────┘

Further reflection suggests a vertical encoding like:

  ┌┤__attrs├────────┬───────┐
  │ relation │ attr │  type │
  │ foo      │ X    │   int │
  │ foo      │ Y    │ float │
  │ foo      │ Z    │  blob │
  └──────────┴──────┴───────┘

>   ┌┤foo.data├─┬────┬────┐
>   │ name │ a1 │ a2 │ a3 │
>   │  foo │ x1 │ y1 │ z1 │
>   │  foo │ x2 │ y2 │ z2 │
>   └──────┴────┴────┴────┘

  ┌┤__data├──┬─────┬──────┬───────┐
  │ relation │ row │ attr │ value │
  │ foo      │   1 │ X    │ "1"   │
  │ foo      │   1 │ Y    │ "1.0" │
  │ foo      │   1 │ Z    │ "z1"  │
  │ foo      │   2 │ X    │ "2"   │
  │ foo      │   2 │ Y    │ "2.0" │
  │ foo      │   2 │ Z    │ "z2"  │
  └──────────┴─────┴──────┴───────┘


> From there, you have all of your column names, relation names, etc. to
> use FOL to generate the IRIs and literals for the RDF relation. But,
> if you're actually working in a type-safe FOL, you can encode your RDF
> in a relation that has types (beyond strings):
> 
>   ┌┤foo.graph├┬────────┬─────────┬──────┬────────┬──────────┬──────────────┬──────────┐
>   │ sIRI      │ sBNode │ p       │ oIRI │ oBNode │ oLexical │ oDatype      │ oLangTag │
>   │ http…X=x1 │ NULL   │ http…#X │ NULL │ NULL   │       x1 │ http…integer │          │
>   │ http…X=x1 │ NULL   │ http…#Y │ NULL │ NULL   │       y1 │ http…string  │          │
>   │ http…X=x1 │ NULL   │ http…#Z │ NULL │ NULL   │       z1 │ http…float   │          │
>   └───────────┴────────┴─────────┴──────┴────────┴──────────┴──────────────┴──────────┘

I coded this transformation in SQL so folks can play along with the
tools they have at home:

-- Create the sample database (one table):
CREATE TABLE foo (X INT,
                  Y FLOAT,
                  Z BLOB);
INSERT INTO foo (X, Y, Z) VALUES (1, 1.0, "z1"),
                                 (2, 2.0, "z2");

# Higher-order import of table and attribute names into domain of discourse:
CREATE TABLE __attrs (relation VARCHAR(20),
                      attr VARCHAR(20),
                      type VARCHAR(20));
INSERT INTO __attrs (relation, attr, type) VALUES ("foo", "X", "INT"),
                                                  ("foo", "Y", "FLOAT"),
                                                  ("foo", "Z", "BLOB");
CREATE TABLE __data (relation VARCHAR(20),
                     row INT,
                     attr VARCHAR(20),
                     value VARCHAR(20));
INSERT INTO __data (relation, row, attr, value) VALUES ("foo", 1, "X", "1"),
                                                       ("foo", 1, "Y", "1.0"),
                                                       ("foo", 1, "Z", "z1"),
                                                       ("foo", 2, "X", "2"),
                                                       ("foo", 2, "Y", "2.0"),
                                                       ("foo", 2, "Z", "z2");
-- We could capture the keys info in a __keys table, but it's
-- not necessary for this example.

-- Representation of RDF graph:
CREATE TABLE __graph (sIRI VARCHAR(30),
                      sBNode VARCHAR(30),
                      pIRI VARCHAR(30),
                      oIRI VARCHAR(30),
                      oBNode VARCHAR(30),
                      oLiteral VARCHAR(30),
                      oDatatype VARCHAR(30),
                      oLang VARCHAR(10));

-- insert triples like (BNode, IRI, Literal)s
INSERT INTO __graph (sIRI, pIRI, oLiteral, oDatatype)
   SELECT CONCAT("_:", __data.relation, __data.row),
          CONCAT(__data.relation, "#", __data.attr),
          CONCAT("\"", __data.value, "\""), 
          CONCAT("xsd:", __attrs.type)
     FROM __data, __attrs
    WHERE __data.relation=__attrs.relation
      AND __data.attr=__attrs.attr;
-- Et voila, the literal triples (with a BNode subject):
SELECT * FROM __graph;
┌────────┬────────┬───────┬──────┬────────┬──────────┬───────────┬───────┐
│ sIRI   │ sBNode │ pIRI  │ oIRI │ oBNode │ oLiteral │ oDatatype │ oLang │
│ _:foo1 │ NULL   │ foo#X │ NULL │ NULL   │ "1"      │ xsd:INT   │ NULL  │
│ _:foo1 │ NULL   │ foo#Y │ NULL │ NULL   │ "1.0"    │ xsd:FLOAT │ NULL  │
│ _:foo1 │ NULL   │ foo#Z │ NULL │ NULL   │ "z1"     │ xsd:BLOB  │ NULL  │
│ _:foo2 │ NULL   │ foo#X │ NULL │ NULL   │ "2"      │ xsd:INT   │ NULL  │
│ _:foo2 │ NULL   │ foo#Y │ NULL │ NULL   │ "2.0"    │ xsd:FLOAT │ NULL  │
│ _:foo2 │ NULL   │ foo#Z │ NULL │ NULL   │ "z2"     │ xsd:BLOB  │ NULL  │
└────────┴────────┴───────┴──────┴────────┴──────────┴───────────┴───────┘

The skolemizing apprach in the direct mapping wiki would look more like:

INSERT INTO __graph (sIRI, pIRI, oLiteral, oDatatype)
   SELECT CONCAT("_:", __data.relation, __name.skolem),
          CONCAT(__data.relation, "#", __data.attr),
          CONCAT("\"", __data.value, "\""),
          CONCAT("xsd:", __attrs.type)
     FROM __data, __attrs, (
            SELECT __data.relation, __data.row,
                   GROUP_CONCAT(__data.value SEPARATOR "~") AS skolem
             FROM __data GROUP BY relation, row ) AS __name
    WHERE __name.relation=__data.relation
     AND __name.row=__data.row
     AND __data.relation=__attrs.relation
     AND __data.attr=__attrs.attr;
┌───────────────┬────────┬───────┬──────┬────────┬──────────┬───────────┬───────┐
│ sIRI          │ sBNode │ pIRI  │ oIRI │ oBNode │ oLiteral │ oDatatype │ oLang │
│ _:foo1~1.0~z1 │ NULL   │ foo#X │ NULL │ NULL   │ "1"      │ xsd:INT   │ NULL  │
│ _:foo1~1.0~z1 │ NULL   │ foo#Y │ NULL │ NULL   │ "1.0"    │ xsd:FLOAT │ NULL  │
│ _:foo1~1.0~z1 │ NULL   │ foo#Z │ NULL │ NULL   │ "z1"     │ xsd:BLOB  │ NULL  │
│ _:foo2~2.0~z2 │ NULL   │ foo#X │ NULL │ NULL   │ "2"      │ xsd:INT   │ NULL  │
│ _:foo2~2.0~z2 │ NULL   │ foo#Y │ NULL │ NULL   │ "2.0"    │ xsd:FLOAT │ NULL  │
│ _:foo2~2.0~z2 │ NULL   │ foo#Z │ NULL │ NULL   │ "z2"     │ xsd:BLOB  │ NULL  │
└───────────────┴────────┴───────┴──────┴────────┴──────────┴───────────┴───────┘

which would give you 

  _:Debts~juan~50 <Debts#Name> "juan"^^xsd:STRING .
  _:Debts~juan~50 <Debts#Amount> "50"^^xsd:INT .
twice (which would collapse to once), instead of the more faithful
  _:Debts1 <Debts#Name> "juan"^^xsd:STRING .
  _:Debts1 <Debts#Amount> "50"^^xsd:INT .
  _:Debts2 <Debts#Name> "juan"^^xsd:STRING .
  _:Debts2 <Debts#Amount> "50"^^xsd:INT .


Import: getting cardinality right matters. Relational heads like to
work over sets, but ultimately, the practical needs of not having to
guarantee uniqueness always win. The set-based relational engines of
the early 80s all either changed to multi-set (changing many
customers' queries) or folded. By 86, SQL established a multi-set
standard which maintained cardinality even when uniqueness was
projected away. At the end of 2006, Fred Zemke interceded to keep the
SPARQL semantics from having ambiguous cardinality, which cost months
but gave us invaluable definition in the semantics.

> The Denotational Semantics for the Direct Mapping (aka. the algebras and
> the associated mapping function) *already* works on top of multisets
> (and sets and lists), both in the data model for RDB and in the
> higher-order functions we define/use to manipulate this data-structure.
> 
> Alexandre.
> 
> > 
> > 
> > Boris
> > 
> > 
> > 
> > 
> 
> 
> 

-- 
-ericP

Received on Wednesday, 19 January 2011 16:30:19 UTC