Re: Q: ISSUE-42 bNode semantics

On 22 May 2011, at 21:52, Richard Cyganiak wrote:

> On 22 May 2011, at 18:44, Enrico Franconi wrote:
>>> Both of these omit the really interesting case of equality tests (WHERE col1=col2 in SQL; FILTER(?col1=?col2) in SPARQL).
>> 
>> No. CQs (or SPJ queries) *do* include joins - namely equality.
> 
> I meant that col1 and col2 are in the same table. That's not a join, it's a selection in relational algebra terms.

Nope. 
SELECT * FROM R WHERE R.A1=R.A2 is a join.
SELECT * FROM R WHERE R.A1=c    is a select.
It universally known that CQs (existential conjunctive fragment of FOL), SPJ queries (the SPJ fragment of relational algebra), SFW queries (the positive select-from-where fragment of SQL) are equally expressive. Just to set a common terminology.

>>> The thing is that NULL=NULL is not true in SQL, but "NULL^^rdb2rdf:null="NULL^^rdb2rdf:null is true in SPARQL. (The translation in the wiki is incorrect for this case.)
>> 
>> That's exactly my argument. And that's why I introduce in the wiki the (?X ≠ NULL) conjunct to let things behave well again. It can be proved that this is enough to recover correctness and completeness in the case of SPJ.
> 
> Adding a (?x != "NULL"^^rdb2rdf:null) condition on join variables makes the joins work, but not equality constraints in the general case. You have to re-implement SQL's 3VL to make the general case work. That might well be possible, but it's certainly a bit more complicated than what you have shown so far.

As I said, this is correct for CQ/SPJ/SFW queries; I don't know at this point what do you really mean for equality constraints; if they are what I said, then it is still correct. 
And as I said, going more expressive than that requires some thinking, but for the boolean combinations of CQs it is just enough to read the SQL specs for the booleans (where basically it is said how to deal with the duplicates of NULLs); that's why I was proposing this fragment. Other extensions seem more problematic (to me!) since at that point we are going to have a (syntactic) mismatch between SQL and SPARQL (see Arenas&Gutierrez) and finding a "natural" correspondence may be hard.

>>> The non-null-preserving RDB-to-RDF translation (with a naive query translation) *is* complete and correct for query answering, if one considers only QA over BGPs using SPARQL semantics. (Because where SQL selection semantics allows null values in query results, BGP matching semantics simply rejects such tuples.)
>> 
>> I also believe this is true (on the basis that a null being an absent value corresponds to a failure of the join), but the whole point is that it doesn't scale up when you make the query language slightly more expressive than CQs (or BGPs) in absence of schema information *and* peculiar translation of the queries. See in the wiki my example with MINUS.
> 
> I don't know what you mean here. BGP matching doesn't include MINUS. Neither does SPARQL (at least in 1.0). SPARQL 1.0 has !BOUND, which is easily translated to SQL.

MINUS can be easily emulated in SPARQL. This non direct correspondence between SQL and SPARQL is exactly what worries me - as I said in my previous paragraph above. E.g., it is natural to define boolean combinations of CQs in SQL and in Relational Algebra (RA), but this has an unnatural correspondence in SPARQL. However, in order to understand the behaviour of SQL NULLs, we have to start with the SQL specs, following its grammar.

>>> I believe that it the non-null-preserving translation, in the presence of schema information, is actually correct and complete for all of SPARQL 1.0.
>> 
>> That's the tricky bit: "in the presence of schema information"
> 
> I don't see why that's tricky -- to map the data, you need schema information anyway; and I don't see how you can translate “SELECT * FROM table1” to SPARQL without schema information either.

Sure. But the tricky part is how to exploit the schema information to devise the translation of the query...

>> (and I add *and* peculiar translation of the queries).
> 
> Sure.

Exactly - this is what I meant.

>> Maybe yes, maybe not. Personally, I'd try not to reinvent the hot water, and I'd go through the same route as SQL, since this is exactly what we are mapping from.
> 
> We are mapping the *data* from relational to RDF. But mapping the *queries* from SPARQL to SQL makes more sense because that is what implementers actually need.

Indeed.

>>> I think this can be shown using the literature on SPARQL-to-SQL translation, e.g., [1][2][3].
>> 
>> Please tell me exactly where to look at.
> 
> [1] is a good place to start. It shows how to map all of SPARQL to equivalent SQL queries.

But we don't need that, we need the opposite. And we need SQL semantics of NULLs to start with, not the bnode semantics for NULLs.

> It assumes a single-table three-column s,p,o database though, rather than the traditional domain database schemas we are considering here, so it would need some examination on how to adapt it to this 
> 
>>> If we had to chose between basing the criteria for correctness on SPARQL (BGP matching) semantics or SQL semantics (like in the proposal on the wiki), then I would argue that SPARQL is more appropriate.
>> 
>> I don't understand what are you saying here. The correctness is about the QA problem, which involves both SQL and SPARQL.
> 
> SQL semantics includes NULL. SPARQL semantics doesn't -- it just has unbound which works differently. Therefore, when translating SPARQL to SQL, dealing with the issue of NULL is easier.

Indeed, there are no nulls :-)

> Add an IS NOT NULL whenever a variable is bound, and COALESCE on join variables that could potentially be unbound, and that should cover everything.

Yes, yes, but this does not help our issue at hand.

>>> So the expressivity that is ultimately needed is complete SPARQL, and numerous implementations of this have been around for years.
>> 
>> Of course again. But the issue is how to prove the correctness of the whole endeavour. You don't want really to have a standard without be certain that what you are proposing is correct. It seems to me that NULL values introduce enough complexity for a WG: usually things like these should be already in the practice and the theory before they become standard, and AFAIK NULLs are not in this position.
> 
> NULLs may not be in the theory of RDB-to-RDF translation, but they certainly are in practice already. The implementations I'm familiar with do not generate triples for NULL values, and at least none of the 15,000 users who have downloaded D2RQ have complained about that.

But this does not mean that we can standardise something without understanding the meaning beyond that, but most importantly how it relates with the source information in the RDB: namely, is the whole process sound & complete?

>>> That being said, I would disagree that query answering is sufficient for establishing the correctness of a mapping. Using only this criterion would potentially allow for multiple different correct and complete mappings.
>> 
>> Of course, as soon as they all give you the very same undistinguishable answer, I don't see why people can not have their own optimised version of the translation.
> 
> Just to be clear: The QA-only criterion would allow for multiple correct and complete mappings that produce *different* results. For example, creating a distinct blank node for each NULL in an input table, instead of "NULL"^^rdb2rdf:null -- I think the query translation in this case would be even easier than in the proposal on the wiki.
> 
> And the whole point of the direct mapping effort is to have a single, standard mapping.

I accept the idea to have a standard mapping, so that one can build interoperable systems. 
However, (1) your example does not work (since I couldn't distinguish bnodes from NULLs, which is at the heart of the SQL based proposal to deal with NULLs), and (2) the criterion of soundness and completeness by itself enforces the fact that the results are always the same (otherwise it wouldn't be sound and complete!).

>>> I think that a further criterion has to be compatibility with the formal RDF semantics.
>> 
>> But this is impossible with NULL values, it seems to me.
> 
> I thought you had agreed above that the null-ingoring mapping is correct with regard to SPARQL BGP matching,

Yes.

> and perhaps all of SPARQL 1.0.

No. 
And this is why I said it is possibly impossible :-)

> It satisfies the QA criterion (same responses over the RDF and the RDB), and is compatible with RDF semantics.

Only for BGPs and CQs.

>> Remember that you are modelling NULL values, which really mean the disjunction between absence of information and the existence of information, by just saying that they mean absence of information.
> 
> RDF can't express “there is no value”. So in my eyes, *any* mapping is necessarily incomplete because it cannot capture that notion.

Indeed this is exactly what I am saying!!! (Well, to be more precise, we need the “there is no value or there is an unknown value”)

> But that's not a problem. RDF is monotonic, so simply not translating these cells yields incomplete but correct information. (This is a different notion of completeness and correctness, obviously.)

Please define your notion of soundness and incompleteness. 
RDF is monotonic but RDBs are *not*. Since you start with RDBs and you want to encode faithfully their information in RDF *plus* SPARQL on top of it, I conclude that with respect to the logic, the encoding in RDF+SPARQL has to me non-monotonic. 
Anyway, independently on what each if us means by monotonic in this context (I suggest to drop this particular issue...), clearly you are trying to say that your mapping is more elegant, for some notion of elegance (maybe your notion of correctness and completeness): FINE. Still, the issue remains. How to deal with cases beyond BGPs. I already know exactly how to deal with booleans of CQs, and I propose a way to go on further if there is interest and resources in the WG (basically look at what has been done in SQL, where NULLs are actually materialised and therefore they behave as standard constants with some *noted* exceptions in the specs). I still don't see how can we have a clue in your case, and the papers you referred to do not help.

>> While for some cases this may work, in general it is not guaranteed to work, and it wouldn't work when you introduce proper disjunctive information such as in OWL (DL or Lite fragments, however are they called now...).
> 
> What do you mean by “would not work”? What problem would there be with OWL?

OK, this is tricky to explain. Basically, when RDF graphs really have to be considered as open world information (i.e., ABoxes) and where there is real negation and real disjunction, then the representation of disjunctive information such as "no value or some unknown value" as a simple "no value" is going to make a real difference.
But again, this is not so important, by now. With schema information, I can always generate a RDF graph with explicit NULLs (which may be digested by OWL) from your graph without NULLs (and viceversa). What you can not reproduce is the behaviour of SPARQL from one case to the other, obviously.

>> I'd be happy to realise that what you are aiming at could be true, but it sounds like a dream...
>> I try to be more realistic, and I propose that RDF is just a mean to obtain something sound and complete.
>> 
>>> Furthermore, when talking about completeness, I submit that given a correct mapping graph G, any graph G' is also correct (but possibly incomplete) if G RDF-entails G'. Under this -- in my eyes very appropriate -- notion of correctness, it is easy to show that the non-null-translating mapping is correct (but incomplete) since it is RDF-entailed by any null-translating mapping.
>> 
>> Interesting property, but since you push for your own mapping :-) I guess you have to show some (in)formal evidence that this can be achieved.
> 
> Well, let's assume your mapping which translates NULL to a special literal is correct.
> An RDF graph entails any of its subgraphs. The mapping which ignores NULLs is a subgraph of your mapping for any given database. So your mapping RDF-entails the null-ignoring mapping, which would therefore also be correct.

I don't think you can make this line of reasoning.
Take just two standard NULL-free non-empty tables R1 and R2. Consider the query R1 MINUS R2. The graph with R1 and the empty table R2 is RDF-entailed. But now the previous query may get a bigger answer, and therefore it is unsound. Similar observations can be made with aggregated queries. So: we can not have the property you desire, not even for the standard cases.

cheers
--e.

> 
> Best,
> Richard
> 
> 
>> 
>> cheers
>> --e.
>> 
>>> 
>>> Best,
>>> Richard
>>> 
>>> [1] Chebotko et al, Semantics preserving SPARQL-to-SQL translation
>>>  http://www.sciencedirect.com/science/article/pii/S0169023X09000469
>>> [2] Pérez et al, Semantics and complexity of SPARQL
>>>  http://portal.acm.org/citation.cfm?id=1567278
>>> [3] Cyganiak, A relational algebra for SPARQL
>>>  http://www.hpl.hp.com/techreports/2005/HPL-2005-170.pdf
>>> 
>>> 
>>>> On 21 May 2011, at 16:23, Richard Cyganiak <richard@cyganiak.de> wrote:
>>>> 
>>>>> On 20 May 2011, at 23:04, Enrico Franconi wrote:
>>>>>>> Your argument hinges on a claim that one proposal is correct, and another is incorrect. Can you state the criteria that an RDB2RDF mapping has to fulfill so that you consider it correct and complete?
>>>>>> 
>>>>>> As I say in the wiki, I consider the query answering (QA) problem. A translation from data and queries in RDB/SQL to RDF/SPARQL is sound whenever any tuple returned in the translated QA problem is also returned in the original QA problem; it is complete if any tuple returned in the original QA problem is also returned in the translated QA problem.
>>>>> 
>>>>> What expressivity of queries does the translation of queries have to cover? SPARQL, SQL or something else?
>>>>> 
>>>>> Thanks,
>>>>> Richard
>>>> 
>>> 
>> 
> 

Received on Sunday, 22 May 2011 21:54:59 UTC