Re: Fear for explicit NULL values from Juan Sequeda on 2011-06-14 (public-rdb2rdf-wg@w3.org from June 2011)

From: Juan Sequeda <juanfederico@gmail.com>
Date: Tue, 14 Jun 2011 11:36:31 -0500
To: Enrico Franconi <franconi@inf.unibz.it>
Cc: RDB2RDF WG <public-rdb2rdf-wg@w3.org>
Message-ID: <BANLkTim1bQCz-PyBc0edCprv1rFx6L+GZQ@mail.gmail.com>
>From a theoretical perspective and with my academic hat on.. I don't care.
Actually, in our paper, we are doing both. These approaches have their
own interesting properties.

>From a developer and reality perspective... people are not expecting nulls
in their RDF. They will be confused. They don't want to see nulls! See [1].

Let's try to make everybody happy. I think we can :)

[1] http://lists.w3.org/Archives/Public/public-rdb2rdf-wg/2011May/0062.html

Juan Sequeda
+1-575-SEQ-UEDA
www.juansequeda.com


On Tue, Jun 14, 2011 at 11:29 AM, Enrico Franconi <franconi@inf.unibz.it>wrote:

>
> On 14 Jun 2011, at 18:05, Juan Sequeda wrote:
>
> On Tue, Jun 14, 2011 at 10:39 AM, Enrico Franconi <franconi@inf.unibz.it>wrote:
>
>> I don't have an answer, and I'd be happy to work offline with you and
>> Marcelo on this; I am working on the semantics of normative SQL null values
>> since 8 months.
>> Obviously, I do have an answer in the case you materialise the NULLs - you
>> still didn't say why you don't like my proposal.
>>
>
> As I previously mentioned, most if not all databases will have null values,
> so your proposal would not allow the direct mapping to be used.
>
>
> I meant: my proposal to materialise the NULL values. So I agree with your
> view below.
> --e.
>
>
> The following is my personal view and vision: I believe that the direct
> mapping will crucial for the RDF/semantic web and data integration world for
> the following reasons:
>
> 1) Instead of a blank R2RML file, you will first direct map your database
> to generate a pre-populated R2RML file. D2R does this. And so does Revelytix
> (if I'm not wrong). They may or may not follow the direct mapping standard,
> but the general idea is to have *a* direct mapping so the author of the
> R2RML file doesn't start blank
> 2) People may have really nice modeled databases, or only want to export a
> few tables to RDF. Even though this is not part of the group's charter, I
> foresee people implementing Direct Mapping+ where you choose which tables
> you want to direct map. At the end, all you need to do is a string
> subsitition from the automatically generated labels (ex:name) to the labels
> that you really want (foaf:name)
> 3) If you are comfortable with semantic web technologies, you can direct
> map your database and then customize the output with SPARQL construct and/or
> RIF
> 4) If you are comfortable with the database, and have access to create
> views and have the Direct Mapping+, you can create the views for the data
> that you want to export and then only direct map.
>
> These are just 4 possible use cases that I see of the direct mapping. It
> was very encouraging to talk to people at Semtech about RDB2RDF because they
> are starting to realize benefits of RDF and they are now thinking a bit
> ahead of themselves: "if I'm going to use RDF for data integration... how do
> I get my rdb data into RDF then?".
>
> So... let's not kill the direct mapping please.
>
>
>> But let's do it offline.
>>
>
> Sounds good.
>
>> --e.
>>
>>
>> On 14 Jun 2011, at 17:26, Juan Sequeda <juanfederico@gmail.com> wrote:
>>
>>
>> On Jun 14, 2011, at 10:19 AM, Enrico Franconi < <franconi@inf.unibz.it>
>> franconi@inf.unibz.it> wrote:
>>
>> I am now seeing something concrete, finally.
>>
>>
>> Thanks. I've actually had this email in draft for a while.
>>
>> This is actually part of the paper that Marcelo Arenas, Dan Miranker and I
>> are writing.
>>
>> We need to double check the generality of the approach.
>>
>>
>> Now we can work on this together. I'm sure we can kind the correct
>> solution instead of coming up with a dramatic proposal :)
>>
>> For example: what if you have a query asking just for the values of an
>> attribute which may contain NULL values (so you don't output the id)?
>> And: how do you solve my query (c) in the wiki? It seems to me that you
>> need also to have a notion of the schema of the answer set.
>> These are the kind of questions I'd like to see answered :-)
>>
>>
>> I need to relook at this. Do you have an answer?
>>
>> Enrico, do you think we can answer your questions over the phone in the
>> next hour? Otherwise, I propose that we skip this topic on today's call so
>> we can have progress on other issues. It looks like we can have more
>> progress over email? Is this ok with you?
>>
>> If so, would this be ok with the chairs?
>>
>> Btw, I'll be in Chile next week with Marcelo and we could have a separate
>> call with interested parties to address this particular issue. My only
>> concern is to get things done for the sept 1 deadline
>>
>> Looking forward to your comments
>>
>>
>> On 14 Jun 2011, at 15:39, Juan Sequeda < <juanfederico@gmail.com><juanfederico@gmail.com>
>> juanfederico@gmail.com> wrote:
>>
>> Why?
>>
>> Because, IMO, this is what the general RDF audience want, and we should
>> create a standard that people are going to use. That is why I disagree with
>> your proposal Enrico. We can't simply state that the direct mapping is not
>> applicable 95% of the time. Then we have wasted 2 years of work on the
>> direct mapping.
>>
>> Our task is to bridge the gap and make sure that everything works.. and I
>> believe that everything will work. The main concern is that if we do not map
>> the NULLs, our mapping will not be information preserving. In other words,
>> "how to rebuild the correct answers with explicit NULLS using the direct
>> mapping" So let me break this down. I believe information preserving holds
>> the following way:
>>
>> Let S be a relational schema and Q a relational query over S. Then there
>> exists a sparql query Q* such that every instance I of S:
>>
>> T(Q(S,I)) = Q*(M(S,I))
>>
>> Any relational query Q can be broken down into a set identity query which
>> is essentially SELECT * FROM table, for all tables that are part of the
>> query. This identiy relational query is equal to the following sparql query:
>>
>> (?x A1 ?Ai) OPT ... OPT (?x An ?An))
>>
>> were Ai is every attribute of table. This is where the schema comes in. We
>> need to know all the attributes that are part of each table so we can build
>> this sparql query.
>>
>> So now what we are missing are the Nulls.
>>
>> In the sparql query Q*, the solution mapping does not output nulls. But
>> the result of the relational query Q does output nulls. This is where
>> function T comes in. Function T maps a relational query output to a sparql
>> solution mapping... and all this function does is "not output the nulls".
>> Given that we have the schema we can reconstruct the nulls. For example, if
>> I have the following:
>>
>> Q(S,I) = {id = 1, age, null}
>>
>> Then
>>
>> T(Q(S,I)) = {id = 1}
>>
>> This is going to be equal to the sparql solution mapping. If we want to
>> reverse this, T', given the schema we know that the attributes consist of
>> "id" and "age" and because the solution mapping only consist of "id", then
>> for all the missing attributes, they are mapped to null.
>>
>> In conclusion, we need to be explicit about this function T and state what
>> it does. My proposal is that in the direct mapping we have this function T
>> which maps to null value to "nothing" and T' will map the missing attributes
>> to null. I believe that with my proposal, everything should work.
>>
>> Enrico, where am I wrong?
>>
>>
>> Juan Sequeda
>> +1-575-SEQ-UEDA
>> <http://www.juansequeda.com/> <http://www.juansequeda.com/><http://www.juansequeda.com/>
>> www.juansequeda.com
>>
>>
>> On Mon, Jun 13, 2011 at 3:05 PM, Enrico Franconi <<franconi@inf.unibz.it><franconi@inf.unibz.it><franconi@inf.unibz.it>
>> franconi@inf.unibz.it> wrote:
>>
>>> I have the impression that people are considering the presence of
>>> explicit NULL values in the data and in the answers as "polluting". In RDBs
>>> NULLs are everywhere, in the data and in the answers, since day one. You
>>> don't have an option not to see them in the data or in the answer. They are
>>> just there, and they have a specific meaning and behaviour (which is the
>>> same in Oracle, M$-SQL-server, etc). Why in mapping RDBs to RDF graphs you
>>> want to hide them as if the are bearing a chronic disease? And by doing
>>> that, why you want to hamper the possibility to keep in the RDF graph the
>>> same behaviour (and meaning) NULLs had in the original RDB?
>>> --e.
>>>
>>
>>
>
>
Received on Tuesday, 14 June 2011 16:37:20 UTC