Re: Fear for explicit NULL values from Juan Sequeda on 2011-06-14 (public-rdb2rdf-wg@w3.org from June 2011)

From: Juan Sequeda <juanfederico@gmail.com>
Date: Tue, 14 Jun 2011 11:05:57 -0500
To: Enrico Franconi <franconi@inf.unibz.it>
Cc: RDB2RDF WG <public-rdb2rdf-wg@w3.org>
Message-ID: <BANLkTin5AytOimu2BDiL63ZqL-o3RP=1SA@mail.gmail.com>
On Tue, Jun 14, 2011 at 10:39 AM, Enrico Franconi <franconi@inf.unibz.it>wrote:

> I don't have an answer, and I'd be happy to work offline with you and
> Marcelo on this; I am working on the semantics of normative SQL null values
> since 8 months.
> Obviously, I do have an answer in the case you materialise the NULLs - you
> still didn't say why you don't like my proposal.
>

As I previously mentioned, most if not all databases will have null values,
so your proposal would not allow the direct mapping to be used.

The following is my personal view and vision: I believe that the direct
mapping will crucial for the RDF/semantic web and data integration world for
the following reasons:

1) Instead of a blank R2RML file, you will first direct map your database to
generate a pre-populated R2RML file. D2R does this. And so does Revelytix
(if I'm not wrong). They may or may not follow the direct mapping standard,
but the general idea is to have *a* direct mapping so the author of the
R2RML file doesn't start blank
2) People may have really nice modeled databases, or only want to export a
few tables to RDF. Even though this is not part of the group's charter, I
foresee people implementing Direct Mapping+ where you choose which tables
you want to direct map. At the end, all you need to do is a string
subsitition from the automatically generated labels (ex:name) to the labels
that you really want (foaf:name)
3) If you are comfortable with semantic web technologies, you can direct map
your database and then customize the output with SPARQL construct and/or RIF
4) If you are comfortable with the database, and have access to create views
and have the Direct Mapping+, you can create the views for the data that you
want to export and then only direct map.

These are just 4 possible use cases that I see of the direct mapping. It was
very encouraging to talk to people at Semtech about RDB2RDF because they are
starting to realize benefits of RDF and they are now thinking a bit ahead of
themselves: "if I'm going to use RDF for data integration... how do I get my
rdb data into RDF then?".

So... let's not kill the direct mapping please.


> But let's do it offline.
>

Sounds good.

> --e.
>
>
> On 14 Jun 2011, at 17:26, Juan Sequeda <juanfederico@gmail.com> wrote:
>
>
> On Jun 14, 2011, at 10:19 AM, Enrico Franconi < <franconi@inf.unibz.it>
> franconi@inf.unibz.it> wrote:
>
> I am now seeing something concrete, finally.
>
>
> Thanks. I've actually had this email in draft for a while.
>
> This is actually part of the paper that Marcelo Arenas, Dan Miranker and I
> are writing.
>
> We need to double check the generality of the approach.
>
>
> Now we can work on this together. I'm sure we can kind the correct solution
> instead of coming up with a dramatic proposal :)
>
> For example: what if you have a query asking just for the values of an
> attribute which may contain NULL values (so you don't output the id)?
> And: how do you solve my query (c) in the wiki? It seems to me that you
> need also to have a notion of the schema of the answer set.
> These are the kind of questions I'd like to see answered :-)
>
>
> I need to relook at this. Do you have an answer?
>
> Enrico, do you think we can answer your questions over the phone in the
> next hour? Otherwise, I propose that we skip this topic on today's call so
> we can have progress on other issues. It looks like we can have more
> progress over email? Is this ok with you?
>
> If so, would this be ok with the chairs?
>
> Btw, I'll be in Chile next week with Marcelo and we could have a separate
> call with interested parties to address this particular issue. My only
> concern is to get things done for the sept 1 deadline
>
> Looking forward to your comments
>
>
> On 14 Jun 2011, at 15:39, Juan Sequeda < <juanfederico@gmail.com><juanfederico@gmail.com>
> juanfederico@gmail.com> wrote:
>
> Why?
>
> Because, IMO, this is what the general RDF audience want, and we should
> create a standard that people are going to use. That is why I disagree with
> your proposal Enrico. We can't simply state that the direct mapping is not
> applicable 95% of the time. Then we have wasted 2 years of work on the
> direct mapping.
>
> Our task is to bridge the gap and make sure that everything works.. and I
> believe that everything will work. The main concern is that if we do not map
> the NULLs, our mapping will not be information preserving. In other words,
> "how to rebuild the correct answers with explicit NULLS using the direct
> mapping" So let me break this down. I believe information preserving holds
> the following way:
>
> Let S be a relational schema and Q a relational query over S. Then there
> exists a sparql query Q* such that every instance I of S:
>
> T(Q(S,I)) = Q*(M(S,I))
>
> Any relational query Q can be broken down into a set identity query which
> is essentially SELECT * FROM table, for all tables that are part of the
> query. This identiy relational query is equal to the following sparql query:
>
> (?x A1 ?Ai) OPT ... OPT (?x An ?An))
>
> were Ai is every attribute of table. This is where the schema comes in. We
> need to know all the attributes that are part of each table so we can build
> this sparql query.
>
> So now what we are missing are the Nulls.
>
> In the sparql query Q*, the solution mapping does not output nulls. But the
> result of the relational query Q does output nulls. This is where function T
> comes in. Function T maps a relational query output to a sparql solution
> mapping... and all this function does is "not output the nulls". Given that
> we have the schema we can reconstruct the nulls. For example, if I have the
> following:
>
> Q(S,I) = {id = 1, age, null}
>
> Then
>
> T(Q(S,I)) = {id = 1}
>
> This is going to be equal to the sparql solution mapping. If we want to
> reverse this, T', given the schema we know that the attributes consist of
> "id" and "age" and because the solution mapping only consist of "id", then
> for all the missing attributes, they are mapped to null.
>
> In conclusion, we need to be explicit about this function T and state what
> it does. My proposal is that in the direct mapping we have this function T
> which maps to null value to "nothing" and T' will map the missing attributes
> to null. I believe that with my proposal, everything should work.
>
> Enrico, where am I wrong?
>
>
> Juan Sequeda
> +1-575-SEQ-UEDA
> <http://www.juansequeda.com> <http://www.juansequeda.com><http://www.juansequeda.com>
> www.juansequeda.com
>
>
> On Mon, Jun 13, 2011 at 3:05 PM, Enrico Franconi < <franconi@inf.unibz.it><franconi@inf.unibz.it><franconi@inf.unibz.it>
> franconi@inf.unibz.it> wrote:
>
>> I have the impression that people are considering the presence of explicit
>> NULL values in the data and in the answers as "polluting". In RDBs NULLs are
>> everywhere, in the data and in the answers, since day one. You don't have an
>> option not to see them in the data or in the answer. They are just there,
>> and they have a specific meaning and behaviour (which is the same in Oracle,
>> M$-SQL-server, etc). Why in mapping RDBs to RDF graphs you want to hide them
>> as if the are bearing a chronic disease? And by doing that, why you want to
>> hamper the possibility to keep in the RDF graph the same behaviour (and
>> meaning) NULLs had in the original RDB?
>> --e.
>>
>
>
Received on Tuesday, 14 June 2011 16:07:53 UTC