Re: DM implementation feedback: implementability for tables w/o primary key

* Richard Cyganiak <richard@cyganiak.de> [2012-04-22 23:37+0100]
> For tables without primary key, the DM requires that a “fresh” blank node be allocated to each row.
> 
> This behaviour is encoded in the “IOU” test case:
> http://www.w3.org/2001/sw/rdb2rdf/test-cases/#DirectGraphTC0005
> 
> Implementing this is easy enough when dumping the direct graph into a file, but I believe it is impossible when we need to “hold onto” one of the blank nodes and retrieve additional information about it, e.g., in SPARQL-to-SQL rewriters or implementations of RDF APIs such as the Jena API on top of a DM'd database.
> 
> The case of concern is when the table has neither a PK nor a Unique Key, and the DB engine doesn't support some sort of internal unique row identifier such as Oracle's ROWID. Note that Core SQL 2008 doesn't require anything like ROWID (as far as I can tell), and many DB engines including MySQL, PostgreSQL, SQL Server and HSQLDB don't have a suitable equivalent.

You never need to hold on to any blank node in a table with no unique keys because there can't be a foreign key pointing to such a table. The grammar
[[
  <referential constraint definition> ::=
    FOREIGN KEY <left paren> <referencing columns> <right paren>
      <references specification>
  
  <references specification> ::=
    REFERENCES <referenced table and columns>
      [ MATCH <match type> ] [ <referential triggered action> ]
  
  <match type> ::=
      FULL
    | PARTIAL
    | SIMPLE
  
  <referencing columns> ::=
    <reference column list>
  
  <referenced table and columns> ::=
    <table name> [ <left paren> <reference column list> <right paren> ]
  
  <reference column list> ::=
    <column name list>
]]
is supplemented with constraints that the <reference column list> identify
[[
  … the set of <column name>s contained in the <unique column list> of a unique constraint of the
  referenced table …
]]
or, if there's just a <table name>, that the names of the <referencing columns> correspond to the names of a columns in a PRIMARY KEY.


> (Most if not all of these databases have some equivalent to Oracle's ROWNUM, which looks somewhat promising for implementing this, but I cannot work out any way to actually do it.)
> 
> One immediate corollary of the specified behaviour is that the DM cannot be implemented in R2RML.

Given two databases which differ by whether there's a primary key on a referenced table:
WithPK: [[
  CREATE TABLE "People" (                                       CREATE TABLE "Addresses" (
          "ID" INT,                                               "ID" INT, PRIMARY KEY("ID"),
          "fname" CHAR(10),                                       "city" CHAR(10),
          "addr" INT,                                             "state" CHAR(2)
          FOREIGN KEY("addr") REFERENCES "Addresses"("ID")      )
  )
]]

WithoutPK: [[
  CREATE TABLE "People" (                                       CREATE TABLE "Addresses" (
          "ID" INT,                                               "ID" INT, KEY("ID"), /* unique but not primary */
          "fname" CHAR(10),                                       "city" CHAR(10),
          "addr" INT,                                             "state" CHAR(2)
          FOREIGN KEY("addr") REFERENCES "Addresses"("ID")      )
  )
]]

I would expect either an R2RML or DM-aware query mapper to map
[[
  SELECT ?city
   WHERE {
    ?who <People#fname> "Bob" ; <People#addr> ?addr .
    ?addr <Addresses#city> ?city
   }
]]
to map to an SQL query like
[[
  SELECT city
    FROM People, Addresses
   WHERE People.addr=Addresses.ID
     AND People.fname="Bob"
]]
regardless of whether ID was a primary key. The proposal below encourages
[[
  SELECT city
    FROM People, Addresses
   WHERE People.fname="Bob"
]]
which would emit every address in the database.


> I propose that the requirement for a “fresh” blank node should be relaxed, and implementations that assign the same blank node to identically-valued rows should be considered conforming too. This would require a change to the DirectGraphTC0005 test case mentioned above – using the same blank node _:a instead of _:a and _:c should be acceptable.
> 
> (I wish all databases had ROWID!)
> 
> Best,
> Richard

-- 
-ericP

Received on Monday, 23 April 2012 01:28:51 UTC