Re: ISSUE-41 bNode semantics: Information preservation

Hi Marcelo,

Thank you for the very clear analysis. I agree with all your points, and would like to add the following.

Information preservation is a very useful notion. But it is not as strong as I'd like. It does not constrain the mapping to a translation where database and RDF, intuitively and formally, “mean” the same.

Let us consider a slightly sick example to show what I mean. A mapping that adds 13 to each number in a database while translating to RDF. That's clearly not desirable. But it would be information preserving according to (1) because it is possible to construct a reverse mapping F that substracts 13. It's also information preserving according to (2), because it is possible to form a query Q* for every Q that does the appropriate subtractions to yield the same result as Q.

A stronger notion that prevents such mappings would be semantics preservation. M is semantics preserving if I and M(S,I), expressed in an appropriate logic, are equivalent, so that I <=> M(S,I).

Unfortunately, we know that there cannot be a semantics preserving mapping from relational databases to RDF in the presence of NULLs. NULLs in SQL can indicate the absence of a value, and negation cannot be expressed in RDF.

However, if we can find a mapping M that is information preserving *and* fulfills the following:

I => M(S,I)

then we have quite a strong statement of correctness. M retains all information, and the RDF may have lost some of the semantics of the original database, but at least never contradicts the semantics of the DB.

I believe that I => M(S,I) holds for the direct mapping as currently defined, and as you said, with the addition of schema information it would be easy to show that it is information preserving according to (1).

Best,
Richard




On 24 May 2011, at 03:09, Marcelo Arenas wrote:

> Dear All,
> 
> As far as I can see, two alternative ways of defining information
> preservation have been discussed by the group. Let me try to explain
> these two alternatives.
> 
> Assume that M is a mapping that takes as input a relational database
> schema S and an instance I of S, and produces an RDF graph (this
> function could be the direct mapping defined in
> http://www.w3.org/TR/2011/WD-rdb-direct-mapping-20110324/). Then we
> have the following alternatives.
> 
> (1) We say that M is information preserving if there exists a mapping
> F such that: (1) F takes as input an RDF graph and produces a
> relational instance, and (2) for every relational schema S and
> instance I of S:
> 
> F(M(S,I)) = I
> 
> That is, one can reconstruct the original instance I by using the
> information in M(S,I).
> 
> (2) Assume given a canonical function T that specifies how to
> translate relational tuples into solution mappings
> (http://www.w3.org/TR/rdf-sparql-query/#sparqlSolutions). Then we say
> that mapping M is information preserving if for every relational
> algebra query Q over a relational schema S, there exists a SPARQL
> query Q* such that for every instance I of S:
> 
> T(Q(I)) = Q*(M(S,I))
> 
> That is, the answer to Q over I is "equal" to the answer of Q* over
> M(S,I) (more precisely, the translation according to T of the set of
> tuples that form the answer to Q over I is equal to the set of
> solution mappings that form the answer to Q* over M(S,I)).
> 
> 
> In my opinion, (1) is a simple and natural definition. The direct
> mapping defined in
> http://www.w3.org/TR/2011/WD-rdb-direct-mapping-20110324/ is not
> information preserving according to (1). But if this mapping is
> modified to generate triples that store the initial relational schema
> (in particular, triples for storing the names of the attributes of a
> relation), then the mapping will be information preserving according
> to (1).
> 
> The main drawback of (1) is that it does not impose any restriction on
> the function F. Notion (2) tries to overcome this limitation by
> imposing the restriction that it must be possible to answer every
> relational algebra query Q over the initial data by using a SPARQL
> query Q* over the translated data (notice that this definition does
> not impose any restriction on the SPARQL operators used in Q*). But to
> use notion (2), one needs to choose a canonical function T for
> translating relational tuples into solution mappings. For example, if
> this canonical function is defined by associating to each relational
> tuple t a solution mapping mu such that (notice that relational tuples
> are treated as function):
> 
> - the domain of mu is equal to the domain of t
> - mu(A) = t(A) if t(A) is not null, and mu(A) is a fresh blank node if
> t(A) is null
> 
> Then we have that the direct mapping defined in
> http://www.w3.org/TR/2011/WD-rdb-direct-mapping-20110324/ is not
> information preserving. On the other hand, if the canonical function T
> is defined by associating to each relational tuple t a solution
> mapping mu such that:
> 
> - the domain of mu is equal to the set of attributes A such that t(A)
> is not null
> - mu(A) = t(A) if t(A) is not null
> 
> Then we have that the direct mapping defined in
> http://www.w3.org/TR/2011/WD-rdb-direct-mapping-20110324/ is
> information preserving.
> 
> In my opinion, one of the first questions to answer is how canonical
> mapping T should be defined. Or, more concretely, suppose that we are
> given the following tuple from Enrico's example:
> 
> t(ID) = 1
> t(A) = NULL
> 
> Which one of the following mappings represent the information in this
> tuple?
> 
> (a) mu_1 with domain {ID, A} and such that mu_1(ID) = 1 and mu_1(A) =
> _:b
> (b) mu_2 with domain {ID} and such that mu_2(ID) = 1
> 
> 
> I hope that we can answer this question.
> 
> All the best,
> 
> Marcelo
> 

Received on Tuesday, 24 May 2011 11:01:06 UTC