Re: a rigorous review of http://www.w3.org/2001/sw/rdb2rdf/wiki/RDBNullValues#Direct_Mapping from Ivan Herman on 2011-06-17 (public-rdb2rdf-wg@w3.org from June 2011)

From: Ivan Herman <ivan@w3.org>
Date: Fri, 17 Jun 2011 14:53:21 +0200
To: Eric Prud'hommeaux <eric@w3.org>
Cc: Enrico Franconi <franconi@inf.unibz.it>, RDB2RDF WG <public-rdb2rdf-wg@w3.org>
Message-Id: <DB6FD98B-7263-4BDF-B7E0-FE763AAAEE29@w3.org>
Eric,

just to make it clear for the rest of the WG, too. We passed a resolution on the last telco:

[[[
RESOLUTION: Re ISSUE-42, the WG decides add to the Direct Mapping spec the generating of rdfs:range (and possibly other) schema triples and also add a note to the Direct Mapping spec saying: "The DM does not generate triples for NULL values. For a detailed discussion of this issue, see the following document" which points to a separate working group note
]]]

and your claim based on the analysis below is that even those extra statements are unnecessary in practice in view of the practical ways the DM is used. I can see value of reducing extra dependencies in the document and in the generated triples...

For myself, I find your arguments compelling, but let us see what the others would say...

Ivan



On Jun 15, 2011, at 17:20 , Eric Prud'hommeaux wrote:

> * Enrico Franconi <franconi@inf.unibz.it> [2011-06-15 12:52+0200]
>> I don't know what is the purpose you were imagining for this example.
> 
> Your text "You don't care, but in SQL there is a huge difference. As a matter of fact in SQL you can model both and they would behave in different ways. And the data we are talking about does come from SQL RDBs." in <http://www.w3.org/mid/9B2183C1-7559-4E42-A0B8-A3288FE90C3D@inf.unibz.it>, indicated that this would provide evidence that the direct graph was missing critical information which was available to SQL.
> 
>> I made it with the only purpose to show that:
>> 
>> 1) If you want to model absence of values in SQL you should not use NULL values, but you should use explicit tables obtained by the decomposition of attributes with potential missing values;
>> 
>> 2) In SQL, having absent values gives a different behaviour from having NULL values (see query (c) in the wiki);
> 
> Query (c) doesn't ask the same question of the SQL and RDF representations. You propose that SQL: (select ID from R) MINUS (select ID from R) yields an empty set (true) while the SPARQL analog on the DG: (?X type R) MINUS (?X A :bn) does not. The SPARQL analog for the SQL query is (_:x ID ?id) MINUS (_:y ID ?id), which yields an empty set for any input. The premis of all of your queries in <http://www.w3.org/2001/sw/rdb2rdf/wiki/RDBNullValues#Direct_Mapping> is that the correct answer includes a NULL. This is fundamentally different from how RDF encodes information. The other issue you call out in <RDBNullValues#Direct_Mapping> is that two different databases can create the same DG. That's not a problem; the querier can get at all of the information they want by composing schema-relevent queries.
> 
> The Contacts database emulates exactly the behavior you were calling out in DB1 and contrasts it with another row:
>  ┌┤Contacts├──────┐
>  │ name │ company │
>  ├──────┼─────────┤
>  │  Sue │    NULL │ isomorphic to DB1 row 1: ID→name, A→company
>  │  Bob │   BobCo │ another potential row in DB1's schema
>  └──────┴─────────┘
> Let's assume your schema constraints:
>  [Contacts.name is primary key]
>  [Contacts.company is nullable]
> DG:
>  <Contacts/name=Bob> a <Contacts> ; <Contacts#name> "Bob" ; <Contacts#company> "BobCo" .
>  <Contacts/name=Sue> a <Contacts> ; <Contacts#name> "Sue" .
> You propose that we model NULLs explicity, so I'll call that DGNULL:
> DGNULL:
>  <Contacts/name=Bob> a <Contacts> ; <Contacts#name> "Bob" ; <Contacts#company> "BobCo" .
>  <Contacts/name=Sue> a <Contacts> ; <Contacts#name> "Sue" ; <Contacts#company> <NULL> .
> 
> Following is a translation of the queries expressed in <http://www.w3.org/2001/sw/rdb2rdf/wiki/RDBNullValues#Direct_Mapping> into SPARQL over the Contacts DM (and DMNULL).
> 
> 
> == Query A ==
> 
> Your query (a) asks for everyone with a company ?X:
>  { _:bn a <Contacts> .
>    _:bn <Contacts#company> ?X }:
>      DG                                    DGNULL
>  ┌─────────┐                           ┌─────────┐
>  │ ?X      │                         │ ?X      │
>  │ "BobCo" │                         │ "BobCo" │
>  └─────────┘                         │  <NULL> │
>                           └─────────┘
>                      Analogous SQL:
>  SELECT Contacts.company AS X          SELECT Contacts.company AS X
>    FROM Contacts                         FROM Contacts
>   WHERE company IS NOT NULL
> 
> Given the results you reported ("in DB1: { {?X=null} }"), it appears you wanted a null-able solution:
> Query (a)' asks for everyone with an OPTIONAL company ?X:
> { ?who a <Contacts>
>  OPTIONAL { ?who <Contacts#company> ?X } }:
>      DG                                    DGNULL
>  ┌─────────┐                           ┌─────────┐
>  │ ?X      │    │ ?X      │
>  ├─────────┤    │  <NULL> │
>  │ UNBOUND │    │ "BobCo" │
>  │ "BobCo" │    └─────────┘
>  └─────────┘
>                      Analogous SQL:
>  SELECT Contacts.company AS X
>    FROM Contacts
> The queries over DG are consistent with the behavior of SPARQL over conventional RDF graphs; that is, if you want something to be OPTIONAL, you indicate so in the query.
> You propose adding constraints to the SPARQL query to filter out <NULL> when you e.g. want a non-nullable answer. Simply adding "FILTER (?X != <NULL>)" doesn't work because e.g. strings can't be compared to IRIs:
>      DG                                    DGNULL
>  ┌────┐                                 ┌────┐
>  │ ?X │     │ ?X │
>  └────┘     └────┘
> so we add a guard to make sure we're comparing apples and apples (and still let the oranges through) FILTER (!isIRI(?X) || ?X != <NULL>):
>      DG                                    DGNULL
>  ┌─────────┐                            ┌─────────┐
>  │ ?X      │     │ ?X      │
>  │ "BobCo" │     │ "BobCo" │
>  └─────────┘     └─────────┘
>> From the perspective of query (a), DG is a better choice as it meets user needs without introducing error-prone conventions on top of current RDF practice.
> 
> 
> == Query B ==
> 
> Query (b) asks for the values in the column company after a self-join of Contacts over the column company:
>  { _:bn1 a <Contacts> .
>    _:bn1 <Contacts#company> ?X .
>    _:bn2 a <Contacts> .
>    _:bn2 <Contacts#company> ?X }
>      DG                                   DGNULL
>  ┌─────────┐                           ┌─────────┐
>  │ ?X      │    │ ?X      │
>  │ "BobCo" │    │ "BobCo" │
>  └─────────┘    │  <NULL> │
>      └─────────┘
> Guessing your intent from the stated results ("in DB1: {} in DB2: {}"), I surmise that you further stipulate that _:bn1 != _:bn2, which requires qualified variables:
>  { ?bn1 a <Contacts> .
>    ?bn1 <Contacts#company> ?X .
>    ?bn2 a <Contacts> .
>    ?bn2 <Contacts#company> ?X
>    FILTER (?bn1 != ?bn2) }
>      DG                                    DGNULL
>  ┌────┐                                 ┌────┐
>  │ ?X │     │ ?X │
>  └────┘     └────┘
> Query (b) does not motivate a design change.
> 
> 
> == Query C ==
> 
> I confess I don't understand this one. There's a SQL query "(select ID from R) MINUS (select ID from R)" which I see as
>  { { ?who <Contacts#name> ?ID }
>    MINUS { ?who <Contacts#name> ?ID } }
>      DG                                   DGNULL
>  ┌──────┬─────┐                        ┌──────┬─────┐
>  │ ?who │ ?ID │   │ ?who │ ?ID │
>  └──────┴─────┘   └──────┴─────┘
> agreeing with your expected answers: "in DB1: {} in DB2: {}". You also ask "(?X type R) MINUS (?X A :bn) ", or
>  { { ?X a <Contacts> }
>    MINUS { ?X <Contacts#company> _:bn } }
>      DG                                   DGNULL
> ┌─────────────────────┐                 ┌────┐
> │ ?X                  │   │ ?X │
> │ <Contacts/name=Sue> │   └────┘
> └─────────────────────┘
> but this is a different question than the SQL question. It says "tell me all of the contacts which have no company. The answer in DGNULL is clearly incorrect here.
> 
>> From this I conclude that query (c) motivates no design change.
> 
> 
>> 3) The DMs coming from the representation in SQL with NULLs and the representation in SQL with absent values are "in spirit" the same, namely they state in both cases (modulo some mismatch in the signature) that the value is absent. I am not claiming that the DM are/should be the same for different encodings of conceptually similar data: that's why I'm just saying that they are "in spirit" the same.
> 
> got it. I agree that the 1NF examples are more fruitful.
> 
> 
>> cheers
>> --e.
>> 
>> PS: note that I really meant Q(x), not Q(y,x).
>> 
>> On 15 Jun 2011, at 05:47, Eric Prud'hommeaux wrote:
>> 
>>> * Enrico Franconi <franconi@inf.unibz.it> [2011-06-14 22:21+0200]
>>>> On 14 Jun 2011, at 21:45, Eric Prud'hommeaux wrote:
>>>> 
>>>>> * Enrico Franconi <franconi@inf.unibz.it> [2011-06-14 17:35+0200]
>>>>>> As I said, if you are a good db designer, you would design the schema of your db in a way that no attribute is nullable if you want to represent just total absence of values. How? By decomposing the potentially nullable attributes as separate (pseudo binary) relations (primary key of the relation + the attribute), and by adding a foreign key. The attributes in this relation will never have NULL values, since the absence of a value would be represented as the absence of the tuple. This is exactly your proposed DM where the null values just mean absence of information.
>>>>> 
>>>>> it sounds like you have an example in mind. if you share it with us, we can use it to make informed modelling decisions.
>>>>> 
>>>>>> On the other hand, in SQL I can also write a relationship with some nullable attributes. In this case I mean something different, namely the ambiguity between the total absence of a value and its presence but with an unknown specification.
>>>>>> Queries over nullable attributes may have the NULL value in the answer; its presence may affect further queries, such as in the query (c) in the wiki.
>>>> 
>>>> From your example:
>>>> 
>>>> 
>>>> ┌┤Contacts├──────┐
>>>> │ name │ company │
>>>> ├──────┼─────────┤
>>>> │  Bob │   BobCo │
>>>> │  Sue │    NULL │
>>>> └──────┴─────────┘
>>>> [Contacts.name is primary key]
>>>> [Contacts.company is nullable]
>>> 
>>> DG:
>>> <Contacts/name=Bob> <Contacts#name> "Bob" ; <Contacts#company> "BobCo" .
>>> <Contacts/name=Sue> <Contacts#name> "Sue" .
>>> 
>>> 
>>>> If you really mean that Sue does NOT have any company associated to it, the proper modelling would be:
>>>> 
>>>> ┌┤Person│
>>>> │ name │
>>>> ├──────┤
>>>> │  Bob │
>>>> │  Sue │
>>>> └──────┘
>>>> [Person.name is primary key]
>>>> 
>>>> ┌┤hasContact├────┐
>>>> │ name │ company │
>>>> ├──────┼─────────┤
>>>> │  Bob │   BobCo │
>>>> └──────┴─────────┘
>>>> [{hasContact.name,hasContact.company} is primary key]
>>>> [foreign key from Person.name to hasContact.name]
>>>> [hasContact.company not-nullable]
>>> 
>>> DG':
>>> <Person/name=Bob> <Person#name> "Bob" .
>>> <Person/name=Sue> <Person#name> "Sue" .
>>> <hasContact/name=Bob,company=BobCo> <hasContact#name> <Person/name=Bob> ;
>>>                                   <hasContact#company> "BobCo" .
>>> 
>>> 
>>>> Note that the latter database is exactly the same encoding as the one produced by the DM from the former database (the foreign key being encoded with the rdf:type), namely the one where the intent of the designer is to capture the ABSENCE of a value.
>>>> 
>>>> Indeed, if I ask for all the companies, I'd get from the first database:
>>>> Q(x) :- Contacts(y,x)
>>> 
>>> Should that be Q(y,x), given the result {BobCo, NULL} below?
>>> Presuming so, and converting to SQL:
>>> SELECT name, company FROM Contacts
>>> 
>>>> --> {BobCo, NULL}
>>>> while I'd get from the second database:
>>>> Q'(x) :- Person(y),hasContact(y,x)
>>> 
>>> SQL: SELECT Person.name FROM Person INNER JOIN hasContacts ON hasContact.name=Person.name
>>> 
>>>> --> {BobCo}
>>>> 
>>>> So, I get definitely two different answers. Note again that a direct use of the DM (without the schema) consistently gives you the second answer.
>>>> 
>>>> Now, Juan claims that with the schema we can reconstruct the answer of the first kind from representations of the second kind. I believe that it is possible - after all, we do not have information loss if we have the schema. The problem could be that relational algebra (e.g., SPARQL) may not be expressive enough to reconstruct systematically the right answers from the representations of the second kind for all relational algebra queries. That's way I did not guarantee a positive outcome from this investigation.
>>> 
>>> Above, you've outlined two ways to encode application data, requiring the data to be accessed by different SQL (or relational algebra, or tuple calculus) queries. Likewise, the DG is different for the two databases, as are the SPARQL queries required to access this data:
>>> 
>>> DG: SELECT ?n ?c { ?who <Contacts#name> "Bob" 
>>>       OPTIONAL { ?who <Contacts#company> ?c } }
>>> 
>>> DG': SELECT ?n { ?contact <hasContact#name> ?who ;
>>>                         <hasContact#company> "BobCo" .
>>>                ?who <Person#name> ?n }
>>> 
>>> It is beyond our goals (nor is it achievable) to produce a direct graph which is constant through all potential encodings of conceptually similar application data. I say that our needs are met by the current DM with no modification.
>>> 
>>> 
>>>> cheers
>>>> --e.
>>>> 
>>>> 
>>>> 
>>> 
>>> -- 
>>> -ericP
>> 
> 
> -- 
> -ericP
> 


----
Ivan Herman, W3C Semantic Web Activity Lead
Home: http://www.w3.org/People/Ivan/
mobile: +31-641044153
PGP Key: http://www.ivan-herman.net/pgpkey.html
FOAF: http://www.ivan-herman.net/foaf.rdf
Received on Friday, 17 June 2011 12:50:32 UTC